The Plan

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a high-level overview of the reasoning behind my research priorities, written as a Q&A.

What’s your plan for AI alignment?

Step 1: sort out our fundamental confusions about agency

Step 2: ambitious value learning (i.e. build an AI which correctly learns human values and optimizes for them)

Step 3: …

Step 4: profit!

… and do all that before AGI kills us all.

That sounds… awfully optimistic. Do you actually think that’s viable?

Better than a 50/50 chance of working in time.

Do you just have really long timelines?

No. My median is maybe 10-15 years, though that’s more a gut estimate based on how surprised I was over the past decade rather than a carefully-considered analysis. (I wouldn’t be shocked by another AI winter, especially on an inside view, but on an outside view the models generating that prediction have lost an awful lot of Bayes Points over the past few years.)

Mostly timelines just aren’t that relevant; they’d have to get down to around 18-24 months before I think it’s time to shift strategy a lot.

… Wat. Not relevant until we’re down to two years?!?

To be clear, I don’t expect to solve the whole problem in the next two years. Rather, I expect that even the incremental gains from partial progress on fundamental understanding will be worth far more than marginal time/effort on anything else, at least given our current state.

At this point, I think we’re mostly just fundamentally confused about agency and alignment. I expect approximately-all of the gains-to-be-had come from becoming less confused. So the optimal strategy is basically to spend as much time as possible sorting out as much of that general confusion as possible, and if the timer starts to run out, then slap something together based on the best understanding we have.

18-24 months is about how long I expect it to take to slap something together based on the best understanding we have. (Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)

But iterative engineering is important!

In order for iterative engineering to be useful, we first need to have a strong enough understanding of what we even want to achieve in order to recognize when an iteration has brought us a step closer to the goal. No amount of A/B testing changes to our website will make our company profitable if we’re measuring the wrong metrics. I claim that, for alignment, we do not yet have a strong enough understanding for iteration to produce meaningful progress.

When I say “we’re just fundamentally confused about agency and alignment”, that’s the sort of thing I’m talking about.

To be clear: we can absolutely come up with proxy measures of alignment. The problem is that I don’t expect iteration under those proxy measures to get us meaningfully closer to aligned AGI. No reasonable amount of iterating on gliders’ flight-range will get one to the moon.

But engineering is important for advancing understanding too!

I do still expect some amount of engineering to be central for making progress on fundamental confusion. Engineering is one of the major drivers of science; failed attempts to build amplifiers drove our first decent understanding of semiconductors, for instance. But this is a very different path-to-impact than directly iterating on “alignment”, and it makes sense to optimize our efforts differently if the path-to-impact is through fundamental understanding. Just take some confusing concept which is fundamental to agency and alignment (like abstraction, or optimization, or knowledge, or …) and try to engineer anything which can robustly do something with that concept. For instance, a lot of my own work is driven by the vision of a “thermometer of abstraction”, a device capable of robustly and generalizably measuring abstractions and presenting them in a standard legible format. It’s not about directly iterating on some alignment scheme, it’s about an engineering goal which drives and grounds the theorizing and can be independently useful for something of value.

Also, the theory-practice gap is a thing, and I generally expect the majority of “understanding” work to go into crossing that gap. I consider such work a fundamental part of sorting out confusions; if the theory doesn’t work in practice, then we’re still confused. But I also expect that the theory-practice gap is only very hard to cross the first few times; once a few applications work, it gets much easier. Once the first field-effect transistor works, it’s a lot easier to come up with more neat solid-state devices, without needing to further update the theory much. That’s why it makes sense to consider the theory-practice gap a part of fundamental understanding in its own right: once we understand it well enough for a few applications, we usually understand it well enough to implement many more with much lower marginal effort.

An analogy: to go from medieval castles to skyscrapers, we don’t just iterate on stone towers; we leverage fundamental scientific advances in both materials and structural engineering. My strategy for building the tallest possible metaphorical skyscraper is to put all my effort into fundamental materials and structural science. That includes testing out structures as-needed to check that the theory actually works, but the goal there is understanding, not just making tall test-towers; tall towers might provide useful data, but they’re probably not the most useful investment until we’re near the end-goal. Most of the iteration is on e.g. metallurgy, not on tower-height directly. Most of the experimentation is on e.g. column or beam loading under controlled conditions, again not on tower-height directly. If the deadline is suddenly 18-24 months, then it’s time to slap together a building with whatever understanding is available, but hopefully we figure things out fast enough that the deadline isn’t that limiting of a constraint.

What do you mean by “fundamentally confused”?

My current best explanation of “fundamental confusion” is that we don’t have the right frames. When thinking about agency or alignment, we do not know:

  • What are the most important questions to ask?
  • What approximations work?
  • What do we need to pay attention to, and what can we safely ignore?
  • How can we break the problem/system up into subproblems/subsystems?

For all of these, we can certainly make up some answers. The problem is that we don’t have answers to these questions which seem likely to generalize well. Indeed, for most current answers to these questions, I think there are strong arguments that they will not generalize well. Maybe we have an approximation which works well for a particular class of neural networks, but we wouldn’t expect it to generalize to other kinds of agenty systems (like e.g. a bacteria), and it’s debatable whether it will even apply to future ML architectures. Maybe we know of some possible failure modes for alignment, but we don’t know which of them we need to pay attention to vs which will mostly sort themselves out, especially in future regimes/architectures which we currently can’t test. (Even more important: there’s only so much we can pay attention to at all, and we don’t know what details are safe to ignore.) Maybe we have a factorization of alignment which helps highlight some particular problems, but the factorization is known to be leaky; there are other problems which it obscures.

By contrast, consider putting new satellites into orbit. At this point, we generally know what the key subproblems are, what approximations we can make, what to pay attention to, what questions to ask. Most importantly, we are fairly confident that our framing for satellite delivery will generalize to new missions and applications, at least in the near-to-medium-term future. When someone needs to put a new satellite in orbit, it’s not like the whole field needs to worry about their frames failing to generalize.

(Note: there’s probably aspects of “fundamental confusion” which this explanation doesn’t capture, but I don’t have a better explanation right now.)

What are we fundamentally confused about?

We’ve already talked about one example: I think we currently do not understand alignment well enough for iterative engineering to get us meaningfully closer to solving the real problem, in the same way that iterating on glider range will not get one meaningfully closer to going to the moon. When iterating, we don’t currently know which questions to ask, we don’t know which things to pay attention to, we don’t know which subproblems are bottlenecks.

Here’s a bunch of other foundational problems/questions where I think we currently don’t know the right framing to answer them in a generalizable way:

  • Is an e-coli an agent? Does it have a world-model, and if so, what is it? Does it have a utility function, and if so, what is it? Does it have some other kind of “goal”?
  • What even are "human values"? What’s the type signature of human values?
  • Given two agents (with potentially completely different world models), how can I tell whether one is "trying to help" the other? What does that even mean?
  • Given a trained neural network, does it contain any subagents? What are their world-models, and what do they want?
  • Given an atomically-precise scan of a whole human brain, body, and local environment, and unlimited compute, calculate the human’s goals/wants/values, in a manner legible to an automated optimizer.
  • Given some physical system, identify any agents in it, and what they’re optimizing for.
  • Back out the learned objective of a trained neural net, and compare it to the training objective.

What kinds of “incremental progress” do you have in mind here?

As an example, I’ve spent the last couple years better understanding abstraction (and I’m currently working to push that across the theory-practice gap). It’s a necessary component for the sorts of questions I want to answer about agency in general (like those above), but in the nearer term I also expect it to provide very strong ML interpretability tools. (This is a technical thing, but if you want to see the rough idea, take a look at the Telephone Theorem post and imagine that the causal models are computational circuits for neural nets. There are still some nontrivial steps after that to adapt the theorem to neural nets, but it should convey the general idea, and it's a very simple theorem.) If I found out today that AGI was two years away, I’d probably spend a few more months making the algorithms for abstraction-extraction as efficient as I could get them, then focus mainly on applying it to interpretability.

(What I actually expect/hope is that I’ll have efficient algorithms demo-ready in the first half of next year, and then some engineers will come along and apply them to interpretability while I work on other things.)

Another example: the next major thing to sort out after abstraction will be when and why large optimized systems (e.g. neural nets or biological organisms) are so modular, and how the trained/evolved modularity corresponds to modular structures in the environment. I expect that will yield additional actionable insights into ML interpretability, and especially into what environmental/training features lead to more transparent ML models.

Ok, the incremental progress makes sense, but the full plan still sounds ridiculously optimistic with 10-15 year timelines. Given how slow progress has been on the foundational theory of agency (especially at MIRI), why do you expect it to go so much faster?

Mostly I think MIRI has been asking not-quite-the-right-questions, in not-quite-the-right-ways.

Not-quite-the-right-questions: when I look at MIRI’s past work on agent foundations, it’s clear that the motivating questions were about how to build AGI which satisfies various desiderata (e.g. stable values under self-modification, corrigibility, etc). Trying to understand agency-in-general was mostly secondary, and was not the primary goal guiding choice of research directions. One clear example of this is MIRI’s work on proof-based decision theories: absolutely nobody would choose this as the most-promising research direction for understanding the decision theory used by, say, an e-coli. But plenty of researchers over the years have thought about designing AGI using proof-based internals.

I’m not directly thinking about how to design an AGI with useful properties. I’m trying to understand agenty systems in general - be it humans, ML systems, e-coli, cats, organizations, markets, what have you. My impression is that MIRI’s agent foundations team has started to think more along these lines over time (especially since Embedded Agency came out), but I think they’re still carrying a lot of baggage.

… which brings us to MIRI tackling questions in not-quite-the-right-ways. The work on Tiling Agents is a central example here: the problem is to come up with models for agents which copy themselves, so copies of the agents “tile” across the environment. When I look at that problem through an “understand agency in general” lens, my immediate thought is “ah, this is a baseline model for evolution”. Once we have a good model for agents which “reproduce” (i.e. tile), we can talk about agents which approximately-reproduce with small perturbations (i.e. mutations) and the resulting evolutionary process. Then we can go look at how evolution actually behaves to empirically check our models.

When MIRI looks at the Tiling Agents problem, on the other hand, they set it up in terms of proof systems proving things about “successor” proof systems. Absolutely nobody would choose this as the most natural setup to talk about evolution. It’s a setup which is narrowly chosen for a particular kind of “agent” (i.e. AI with some provable guarantees) and a particular use-case (i.e. maintaining the guarantees when the AI self-modifies).

Main point: it does not look like MIRI has primarily been trying to sort out fundamental confusions about agency-in-general, at least not for very long; that’s not what they were optimizing for. Their work was much more narrow than that. And this is one of those cases where I expect the more-general theory to be both easier to find (because we can use lots of data from existing agenty systems in biology, economics and ML) and more useful (because it will more likely generalize to many use-cases and many kinds of agenty systems).

Side note: contrary to popular perception, MIRI is an extremely heterogeneous org, and the criticisms above apply to different people at different times to very different degrees. That said, I think it’s a reasonable representation of the median past work done at MIRI. Also, MIRI is still the best org at this sort of thing, which is why I’m criticizing them in particular.

What’s the roadmap?

Abstraction is the main foundational piece (more on that below). After that, the next big piece will be selection theorems, and I expect to ride that train most of the way to the destination.

Regarding selection theorems: I think most of the gap between aspects of agency which we understand in theory, and aspects of agenty systems which seem to occur consistently in practice, come from broad and robust optima. Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are “broad”: optima whose basins fill a lot of parameter/genome space. And they find optima which are robust: small changes in the distribution of the environment don’t break them. There are informal arguments that this leads to a lot of key properties:

  • Modularity of the trained/evolved system (which we do indeed see in practice)
  • Good generalization properties
  • Information compression
  • Goal-directedness

… but we don’t have good formalizations of those arguments, and we’ll need the formalizations in order to properly leverage these properties for engineering.

Besides that, there’s also some cruft to clean up in existing theorems around agency. For instance, coherence theorems (i.e. the justifications for Bayesian expected utility maximization) have some important shortcomings, and are incomplete in important ways. And of course there’s also work to be done on the theoretical support structure for all this - for instance, sorting out good models of what optimization even means.

Why do we need formalizations for engineering?

It’s not that we need formalizations per se; it’s that we need gears-level understanding. We need to have some understanding of why e.g. modularity shows up in trained/evolved systems, what precisely makes that happen. The need for gears-level understanding, in turn, stems from the need for generalizability.

Let’s get a bit more concrete with the modularity example. We could try to build some non-gears-level (i.e. black-box) model of modularity in neural networks by training some different architectures in different regimes on different tasks and with different parameters, empirically computing some proxy measure of “modularity” for each trained network, and then fitting a curve to it. This will probably work great right up until somebody tries something well outside of the distribution on which this black-box model was fit. (Those crazy engineers are constantly pushing the damn boundaries; that’s largely why they’re so useful for driving fundamental understanding efforts.)

On the other hand, if we understand why modularity occurs in trained/evolved systems, then we can follow the gears of our reasoning even on new kinds of systems. More importantly, we can design new systems to leverage those gears without having to guess and check.

Now, gears-level understanding need not involve formal mathematics in general. But for the sorts of things I’m talking about here (like modularity or good generalization or information compression in evolved/trained systems), gears-level understanding mostly looks like mathematical proofs, or at least informal mathematical arguments. A gears-level answer to the question “Why does modularity show up in evolved systems?”, for instance, should have the same rough shape as a proof that modularity shows up in some broad class of evolved systems (for some reasonably-general formalization of “modularity” and “evolution”). It should tell us what the necessary conditions are, and explain why those conditions are necessary in such a way that we can modify the argument to handle different kinds of conditions without restarting from scratch.

Why so much focus on abstraction?

Abstraction is a common bottleneck to a whole bunch of problems in agency and alignment. Questions like:

  • If I have some system, what’s the right way to carve out a subsystem (which might be an “agent”, or a “world model”, or an “optimizer”, etc)? This should be robust/general enough to let us confidently say things like e.g. “there are no agents embedded in this trained neural net”.
  • What kinds-of-things show up in world models? For instance, is an AI likely to have internal notions of “tree” or “rock” or “car” which map to the corresponding human notions, and how closely?
  • How can we empirically measure high-level abstract things (like trees or agents) in the real world, in robustly generalizable ways?
  • To the extent that humans care about high-level abstract things like trees or cars, rather than quantum fields, how can we formalize that?
  • How can we translate the internal concepts used by trained ML systems into human-legible concepts, robustly enough that we won’t miss anything important (or at least can tell if we do)?

… and so forth. The important point isn’t any one of these questions; the important point is that understanding abstraction is a blocker for a whole bunch of different things. That’s what makes it an ideal target to focus on. Once it’s worked out, I expect to be unblocked not just on the above questions, but also on other important questions I haven’t even thought of yet - if it’s a blocker for many things already, it’s probably also a blocker for other things which I haven’t noticed.

If I had to pick one central reason why abstraction matters so much, it’s that we don’t currently have a robust, generalizable and legible way to measure high-level abstractions. Once we can do that, it will open up a lot of tricky conceptual questions to empirical investigation, in the same way that robust, generalizable and legible measurement tools usually open up scientific investigation of new conceptual areas.

But, like, 10-15 years?!?

A crucial load-bearing part of my model here is that agency/alignment work will undergo a phase transition in the next ~5 years. We’ll go from a basically-preparadigmatic state, where we don’t even know what questions to ask or what tools to use to answer them, to a basically-paradigmatic state, where we have a general roadmap and toolset. Or at the very least I expect to have a workable paradigm, whether anyone else jumps on board is a more open question.

There’s more than one possible path here, more than one possible future paradigm. My estimate of “~5 years” comes from eyeballing the current rate of progress, plus a gut feel for how close the frames are to where they need to be for progress to take off.

As an example of one path which I currently consider reasonably likely: abstraction provides the key tool for the phase transition. Once we can take a simulated environment or a trained model or the like, and efficiently extract all the natural abstractions from it, that changes everything. It’ll be like introducing the thermometer to the study of thermodynamics. We’ll be able to directly, empirically answer questions like “does this model know what a tree is?” or “does this model have a notion of human values?” or “is ‘human’ a natural abstraction?” or “are the agenty things in this simulation natural abstractions?” or …. (These won’t be yes/no answers, but they’ll be quantifiable in a standardized and robustly-generalizable way.) This isn’t a possibility I expect to be legibly plausible to other people right now, but it’s one I’m working towards.

Another path: once a few big selection theorems are sorted out (like modularity of evolved systems, for instance) and empirically verified, we’ll have a new class of tools for empirical study of agenty systems. Like abstraction measurement, this has the potential to open up a whole class of tricky conceptual questions to empirical investigation. Things like “what is this bacteria’s world model?” or “are there any subagents in this trained neural network?”. Again, I don’t necessarily expect this possibility to be legibly plausible to other people right now.

To be clear: not all of my “better than 50/50 chance of working in time” comes from just these two paths. I’ve sketched a fair amount of burdensome detail here, and there’s a lot of variations which lead to similar outcomes with different details, as well as entirely different paths. But the general theme is that I don’t think it will take too much longer to get to a point where we can start empirically investigating key questions in robustly-generalizable ways (rather than the ad-hoc methods used for empirical work today), and get proper feedback loops going for improving understanding.

Why ambitious value learning?

It’s the best-case outcome. I mean, c’mon, it’s got “ambitious” right there in the name.

… but why not aim for some easier strategy?

The main possibly-easier strategy for which I don’t know of any probably-fatal failure mode is to emulate/simulate humans working on the alignment problem for a long time, i.e. a Simulated Long Reflection. The main selling point of this strategy is that, assuming the emulation/simulation is accurate, it probably performs at least as well as we would actually do if we tackled the problem directly.

This is really a whole class of strategies, with many variations, most of which involve training ML systems to mimic humans. (Yes, that implies we’re already at the point where it can probably FOOM.) In general, the further the variations get from just directly simulating humans working on alignment basically the way we do now (but for longer), the more possibly-fatal failure modes show up. HCH is a central example here: for some reason a structure whose most obvious name is The Infinite Bureaucracy was originally suggested as an approximation of a Long Reflection. Look, guys, there is no way in hell that The Infinite Bureaucracy is even remotely a good approximation of a Long Reflection. Naming it “HCH” does not make it any less of an infinite bureaucracy, and yes it is going to fail in basically the same ways as real bureaucracies and for basically the same underlying reasons (except even worse, because it’s infinite).

… but the failure of variations does not necessarily mean that the basic idea is doomed. The basic idea seems basically-sound to me; the problem is implementing it in such a way that the output accurately mimics a real long reflection, while also making it happen before unfriendly AGI kills us all.

Personally, I’m still not working on that strategy, for a few main reasons:

  • I expect my current strategy to be more competitive. One big advantage of understanding agency in general is that we can apply that understanding to whatever ML/AI progress comes along, even if it ends up looking very different from e.g. GPT-3.
  • The Simulated Long Reflection strategy gets more likely to work when we have people for it to mimic who are already far down the road to solving alignment. The further, the better.
  • On a gut level, I just don’t expect ML to emulate humans accurately enough for a Simulated Long Reflection to work until we’ve already passed doomsday. (This is probably the cruxiest issue.)

I am generally happy that other people are working on strategies in the Simulated Long Reflection family, and hope that such work continues.

New Comment
79 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I want to disagree about MIRI. 

Mostly, I think that MIRI (or at least a significant subset of MIRI) has always been primarily directed at agenty systems in general.

I want to separate agent foundations at MIRI into three eras. The Eliezer Era (2001-2013), the Benya Era (2014-2016), and the Scott Era(2017-). 

The transitions between eras had an almost complete overhaul of the people involved. In spite of this, I believe that they have roughly all been directed at the same thing, and that John is directed at the same thing.

The proposed mechanism behind the similarity is not transfer, but instead because agency in general is a convergent/natural topic.

I think throughout time, there has always been a bias in the pipeline from ideas to papers towards being more about AI. I think this bias has gotten smaller over time, as the agent foundations research program both started having stable funding, and started carrying less and less of the weight of all of AI alignment on its back. (Before going through editing with Rob, I believe Embedded Agency had no mention of AI at all.)

I believe that John thinks that the Embedded Agency document is especially close to his agenda, so I will sta... (read more)

I generally agree with most of this, but I think it misses the main claim I wanted to make. I totally agree that all three eras of MIRI's agent foundations research had some vision of the general theory of agency behind them, driving things. My point of disagreement is that, for most of MIRI's history, elucidating that general theory has not been the primary optimization objective.

Let's go through some examples.

The Sequences: we can definitely see Eliezer's understanding of the general theory of agency in many places, especially when talking about Bayes and utility. (Engines of Cognition is a central example.) But most of the sequences talk about things like failure modes of human cognition, how to actually change your mind, social failure modes of human cognition, etc. It sure looks like the primary optimization objective is about better human thinking, plus some general philosophical foundations, not the elucidation of the general theory of agency.

Tiling agents and proof-based decision theories: I'm on board with the use of proof-based setups to make minimal assumptions about "the substrate that the agency is made of". That's an entirely reasonable choice, and it does look like t... (read more)

Hmm, yeah, we might disagree about how much reflection(self-reference) is a central part of agency in general.

It seems plausible that it is important to distinguish between the e-coli and the human along a reflection axis (or even more so, distinguish between evolution and a human). Then maybe you are more focused on the general class of agents, and MIRI is more focused on the more specific class of "reflective agents."

Then, there is the question of whether reflection is going to be a central part of the path to (F/D)OOM.

Does this seem right to you?

To operationalize, I claim that MIRI has been directed at a close enough target to yours that you probably should update on MIRI's lack of progress at least as much as you would if MIRI was doing the same thing as you, but for half as long.

Which isn't *that* large an update. The average number of agent foundations researchers (That are public facing enough that you can update on their lack of progress) at MIRI over the last decade is like 4.

Figuring out how to factor in researcher quality is hard, but it seems plausible to me that the amount of quality adjusted attention directed at your subgoal over the next decade is significantly larger than the amount of attention directed at your subgoal over the last decade. (Which would not all come from you. I do think that Agent Foundations today is non-trivially closer to John today that Agent Foundations 5 years ago is to John today.)

It seems accurate to me to say that Agent Foundations in 2014 was more focused on reflection, which shifted towards embeddedness, and then shifted towards abstraction, and that these things all flow together in my head, and so Scott thinking about abstraction will have more reflection mixed in than John thinking about abstraction. (Indeed, I think progress on abstraction would have huge consequences on how we think about reflection.)

In case it is not obvious to people reading, I endorse John's research program. (Which can maybe be inferred by the fact that I am arguing that it is similar to my own). I think we disagree about what is the most likely path after becoming less confused about agency, but that part of both our plans is yet to be written, and I think the subgoal is enough of a simple concept that I don't think disagreements about what to do next to have a strong impact on how to do the first step.

This all sounds right. In particular, for folks reading, I symmetrically agree with this part: ... i.e. I endorse Scott's research program, mine is indeed similar, I wouldn't be the least bit surprised if we disagree about what comes next but we're pretty aligned on what to do now. Also, I realize now that I didn't emphasize it in the OP, but a large chunk of my "50/50 chance of success" comes from other peoples' work playing a central role, and the agent foundations team at MIRI is obviously at the top of the list of people whose work is likely to fit that bill. (There's also the whole topic of producing more such people, which I didn't talk about in the OP at all, but I'm tentatively optimistic on that front too.)
That does seem right. I do expect reflection to be a pretty central part of the path to FOOM, but I expect it to be way easier to analyze once the non-reflective foundations of agency are sorted out. There are good reasons to expect otherwise on an outside view - i.e. all the various impossibility results in logic and computing. On the other hand, my inside view says it will make more sense once we understand e.g. how abstraction produces maps smaller than the territory while still allowing robust reasoning, how counterfactuals naturally pop out of such abstractions, how that all leads to something conceptually like a Cartesian boundary, the relationship between abstract "agent" and the physical parts which comprise the agent, etc. If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I'd follow a path a lot more like MIRI's. So yeah, this fits.
One thing I'm still not clear about in this thread is whether you (John) would feel that progress has been made for the theory of agency if all the problems on which MIRI were instantaneously solved. Because there's a difference between saying "this is the obvious first step if you believe reflection is the taut constraint" and "solving this problem would help significantly even if reflection wan't the taut constraint".
I expect that progress on the general theory of agency is a necessary component of solving all the problems on which MIRI has worked. So, conditional on those problems being instantly solved, I'd expect that a lot of general theory of agency came along with it. But if a "solution" to something like e.g. the Tiling Problem didn't come with a bunch of progress on more foundational general theory of agency, then I'd be very suspicious of that supposed solution, and I'd expect lots of problems to crop up when we try to apply the solution in practice. (And this is not symmetric: I would not necessarily expect such problems in practice for some more foundational piece of general agency theory which did not already have a solution to the Tiling Problem built into it. Roughly speaking, I expect we can understand e-coli agency without fully understanding human agency, but not vice-versa.)
4Scott Garrabrant
I agree with this asymmetry.  One thing I am confused about is whether to think of the e-coli as qualitatively different from the human. The e-coli is taking actions that can be well modeled by an optimization process searching for actions that would be good if this optimization process output them, which has some reflection in it.  It feels like it can behaviorally be well modeled this way, but is mechanistically not shaped like this,  I feel like the mechanistic fact is more important, but I feel like we are much closer to having behavioral definitions of agency than mechanistic ones.
I would say the e-coli's fitness function has some kind of reflection baked into it, as does a human's fitness function. The qualitative difference between the two is that a human's own world model also has an explicit self-model in it, which is separate from the reflection baked into a human's fitness function. After that, I'd say that deriving the (probable) mechanistic properties from the fitness functions is the name of the game. ... so yeah, I'm on basically the same page as you here.
Main response is in another comment; this is a tangential comment about prescriptive vs descriptive viewpoints on agency. I think viewing agency as "the pipeline from the prescriptive to the descriptive" systematically misses a lot of key pieces. One central example of this: any properties of (inner/mesa) agents which stem from broad optima, rather than merely optima. (For instance, I expect that modularity of trained/evolved systems mostly comes from broad optima.) Such properties are not prescriptive principles; a narrow optimum is still an optimum. Yet we should expect such properties to apply to agenty systems in practice, including humans, other organisms, and trained ML systems. The Kelly criterion is another good example: Abram has argued that it's not a prescriptive principle, but it is still a very strong descriptive principle for agents in suitable environments. More importantly, I think starting from prescriptive principles makes it much easier to miss a bunch of the key foundational questions - for instance, things like "what is an optimizer?" or "what are goals?". Questions like these need some kind of answer in order for many prescriptive principles to make sense in the first place. Also, as far as I can tell to date, there is an asymmetry: a viewpoint starting from prescriptive principles misses key properties, but I have not seen any sign of key principles which would be missed starting from a descriptive viewpoint. (I know of philosophical arguments to the contrary, e.g. this, but I do not expect such things to cash out into any significant technical problem for agency/alignment, any more than I expect arguments about solipsism to cash out into any significant technical problem.)
As a long-time LW mostly-lurker, I can confirm I've always had the impression MIRI's proof-based stuff was supposed to be a spherical-cow model of agency that would lead to understanding of the messy real thing. What I think John might be getting at is that (my outsider's impression of) MIRI has been more focused on "how would I build an agent" as a lens for understanding agency in general—e.g. answering questions about the agency of e-coli is not the type of work I think of. Which maybe maps to 'prescriptive' vs. 'descriptive'?

I think you've really hit the nail on the head on what's wrong (and right) with the MIRI approach. The Cartesian Frames stuff seems to be the best stuff they've done in this direction.

I've also felt that our lack of understanding of abstraction is one of the key bottlenecks. How concerned are you about insights on this question also applying to unaligned AGI development?

How concerned are you about insights on this question also applying to unaligned AGI development?

Enough that I have considered keeping it secret, but I think keeping it public is a strong net positive relative to our current state (i.e. giant inscrutable vectors of floating-points). If there were, say, another AI winter, then I could easily imagine changing my mind about that.

I feel like your answer to "Why do we need formalizations for engineering?" just restates the claim rather than arguing for it. It sounds like you are saying "...we need formalizations because we need gears-level understanding, and formalizations are the way you get gears-level understanding in this domain." But why are formalizations the way to gears-level understanding in this domain? There are plenty of domains where one can have gears-level understanding without formalization.

Now, gears-level understanding need not involve formal mathematics in general. But for the sorts of things I’m talking about here (like modularity or good generalization or information compression in evolved/trained systems), gears-level understanding mostly looks like mathematical proofs, or at least informal mathematical arguments. A gears-level answer to the question “Why does modularity show up in evolved systems?”, for instance, should have the same rough shape as a proof that modularity shows up in some broad class of evolved systems (for some reasonably-general formalization of “modularity” and “evolution”). It should tell us what the necessary conditions are, and explain why those conditions are necessary in such a way that we can modify the argument to handle different kinds of conditions without restarting from scratch.

Maybe I'm just not interpreting "same rough shape" loosely enough. If pretty much any reasonable argument counts as the same rough shape as a proof, then I take back what I said.

I basically agree with this if we're viewing this post as a standalone. I only had so much space to recursively unpack things, and I figure that the claim will make more sense if people go read a few of the posts on gears-level models and then think for themselves a bit about how what gears-level models look like for questions like "why does modularity show up in evolved/trained systems?". When I say "same rough shape as a proof", I don't necessarily mean any reasonable-sounding argument; the key is that we want arguments with enough precision that we can map out the boundaries of their necessary conditions, and enough internal structure to adapt them to particular situations or new models without having to start over from scratch. In short, it's about the ability to tell exactly when the argument applies, and to apply the argument in many ways and in many places.

(Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)

There's also red teaming time, and lag in idea uptake/marketing, to account for. It's possible that we'll have the solution to FAI when AGI gets invented, but the inventor won't be connected to our community and won't be aware of/sold on the solution.

Edit: Don't forget to account for the actual engineering effort to implement the safety solution and integrate it with capabilities work. Ideally there is time for extensive testing and/or formal verification.

I fear this too, at least because it's the most "yelling-at-the-people-onscreen-to-act-differently" scenario that still involves the "hard part" getting solved. I wish there was more discussion of this.

"Look, guys, there is no way in hell that The Infinite Bureaucracy is even remotely a good approximation of a Long Reflection. Naming it “HCH” does not make it any less of an infinite bureaucracy, and yes it is going to fail in basically the same ways as real bureaucracies and for basically the same underlying reasons"

I guess this isn't immediately obvious for me. Bureaucracies fail because at each level the bosses tell the subordinates what to do and they just have to do it. In HCH, sure each subordinate performs a fixed mental task, but the the boss gets to consider the result and make up its own mind, taking into account the reports from the other subordinates. All this extra processing makes me feel as though it isn't exactly the same thing.

(I'm going to respond here to two different comments about HCH and why bureaucracies fail.)

I think a major reason why people are optimistic about HCH is that they're confused about why bureaucracies fail.

Responding to Chris: if you go look at real bureaucracies, it is not really the case that "at each level the bosses tell the subordinates what to do and they just have to do it". At every bureaucracy I've worked in/around, lower-level decision makers had many de facto degrees of freedom. You can think of this as a generalization of one of the central problems of jurisprudence: in practice, human "bosses" (or legislatures, in the jurisprudence case) are not able to give instructions which unambiguously specify what to do in all the crazy situations which come up in practice. Nor do people at the top have anywhere near the bandwidth needed to decide every ambiguous case themselves; there is far too much ambiguity in the world. So, in practice, lower-level people (i.e. judges at various levels) necessarily make many many judgement calls in the course of their work.

Also, in general, tons of information flows back up the hierarchy for higher-level people to make decisions. There are alr... (read more)

"At every bureaucracy I've worked in/around, lower-level decision makers had many de facto degrees of freedom." - I wasn't disputing this - just claiming that they had to work within the constraints of the higher-level boss. It's interesting to here the rest of your model though.
Thanks for the elaboration. I agree with most/all of this. However, for a capable, well-calibrated, cautious H, it mostly seems to argue that HCH won't be efficient, not that it won't be capable and something-like-aligned. Since the HCH structure itself isn't intended to be efficient, this doesn't seem too significant to me. In particular, the bureaucracy analogy seems to miss that HCH can spend >99% of its time on robustness. (this might look more like science: many parallel teams trying different approaches, critiquing each other and failing more often than succeeding) I'm not sure whether you're claiming: 1. That an arbitrarily robustness-focused HCH would tend to be incorrect/overconfident/misaligned. (where H might be a team including e.g. you, Eliezer, Paul, Wei Dai, [other people you'd want]...) 2. That any limits-to-HCH system we train would need to make a robustness/training-efficiency trade-off, and that the levels of caution/redundancy/red-teaming... required to achieve robustness would make training uncompetitive. 1. Worth noting here that this only needs to be a constant multiplier on human training time - once you're distilling or similar, there's no exponential cost increase. (granted distillation has its own issues) 3. Something else. To me (2) seems much more plausible than (1), so a perils-of-bureaucracy argument seems more reasonably aimed at IDA etc than at HCH. I should emphasize that it's not clear to me that HCH could solve any kind of problem. I just don't see strong reasons to expect [wrong/misaligned answer] over [acknowledgement of limitations, and somewhat helpful meta-suggestions] (assuming HCH decides to answer the question).
This is a capability thing, not just an efficiency thing. If, for instance, I lack enough context to distinguish real expertise from prestigious fakery in some area, then I very likely also lack enough context to distinguish those who do have enough context from those who don't (and so on up the meta-ladder). It's a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor. Similarly, if the interface at the very top level does not successfully convey what I want those one step down to do, then there's no error-correction mechanism for that; there's no way to ground out the top-level question anywhere other than the top-level person. Again, it's a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor. Orthogonal to the "some kinds of cognitive labor cannot be outsourced" problem, there's also the issue that HCH can only spend >99% of its time on robustness if the person being amplified decides to do so, and then the person being amplified needs to figure out the very difficult problem of how to make all that robustness-effort actually useful. HCH could do all sorts of things if the H in question were already superintelligent, could perfectly factor problems, knew exactly the right questions to ask, knew how to deploy lots of copies in such a way that no key pieces fell through the cracks, etc. But actual humans are not perfectly-ideal tool operators who don't miss anything or make any mistakes, and actual humans are also not super-competent managers capable of extracting highly robust performance on complex tasks from giant bureaucracies. Heck, it's a difficult and rare skill just to get robust performance on simple tasks from giant bureaucracies. In general, if HCH requires some additional assumption that the person being amplified is smart enough to do X, then that should be baked into the whole plan from the start so that we can evaluate it properly. Like, if every time someone says "HCH has problem Y"
For complex questions I don't think you'd have the top-level H immediately divide the question itself: you'd want to avoid this single-point-of-failure. In unbounded HCH, one approach would be to set up a scientific community (or a set of communities...), to which the question would be forwarded unaltered. You'd have many teams taking different approaches to the question, teams distilling and critiquing the work of others, teams evaluating promising approaches... [again, in strong HCH we have pointers for all of this]. For IDA you'd do something vaguely similar, on a less grand scale. You can set up error-correction by passing pointers, explicitly asking about ambiguity/misunderstanding at every step (with parent pointers to get context), using redundancy.... I agree that H needs to be pretty capable and careful - but I'm assuming a context where H is a team formed of hand-picked humans with carefully selected tools (and access to a lot of data). It's not clear to me that such a team is going to miss required robustness/safety actions (neither is it clear to me that they won't - I just don't buy your case yet). It's not clear they're in an adversarial situation, so some fixed capability level that can see things in terms of process/meta-levels/abstraction/algorithms... may be sufficient. [once we get into truly adversarial territory, I agree that things are harder - but there we're beyond things failing for the same reasons bureaucracies do] I agree it's hard to get giant bureaucracies to robustly perform simple tasks - I just don't buy the analogy. Giant bureaucracies don't have uniform values, and do need to pay for error correction mechanisms. Here I want to say: Of course there's a "giant unstated list of things..." - that's why we're putting H into the system. It'd be great if we could precisely specify all the requirements on H ahead of time - but if we could do that, we probably wouldn't need H. (it certainly makes sense to specify and check for some X, b
It sounds like roughly this is cruxy. We're trying to decide how reliable <some scheme> is at figuring out the right questions to ask in general, and not letting things slip between the cracks in general, and not overlooking unknown unknowns in general, and so forth. Simply observing <the scheme> in action does not give us a useful feedback signal on these questions, unless we already know the answers to the questions. If <the scheme> is not asking the right questions, and we don't know what the right questions are, then we can't tell it's not asking the right questions. If <the scheme> is letting things slip between the cracks, and we don't know which things to check for crack-slippage, then we can't tell it's letting things slip between the cracks. If <the scheme> is overlooking unknown unknowns, and we don't already know what the unknown unknowns are, then we can't tell it's overlooking unknown unknowns. So: if the dream team cannot figure out beforehand all the things it needs to do to get HCH to avoid these sorts of problems, we should not expect them to figure it out with access to HCH either. Access to HCH does not provide an informative feedback signal unless we already know the answers. The cognitive labor cannot be delegated. (Interesting side-point: we can make exactly the same argument as above about our own reasoning processes. In that case, unfortunately, we simply can't do any better; our own reasoning processes are the final line of defense. That's why a Simulated Long Reflection is special, among these sorts of buck-passing schemes: it is the one scheme which does as well as we would do anyway. As soon as we start to diverge from Simulated Long Reflection, we need to ask whether the divergence will make the scheme more likely to ask the wrong questions, let things slip between cracks, overlook unknown unknowns, etc. In general, we cannot answer this kind of question by observing the scheme itself in operation.) (This is less cruxy, but it's a pr

Curated. Not that many people pursue agendas to solve the whole alignment problem and of those even fewer write up their plan clearly. I really appreciate this kind of document and would love to see more like this. Shoutout to the back and forth between John and Scott Garrabrant about John's characterization of MIRI and its relation to John's work.

[-]Mark XuΩ5130

I want to flag that HCH was never intended to simulate a long reflection. It’s main purpose (which it fails in the worse case) is to let humans be epistemically competitive with the systems you’re trying to train.

I mean, we have this thread with Paul directly saying "If all goes well you can think of it like 'a human thinking a long time'", plus Ajeya and Rohin both basically agreeing with that.
5Mark Xu
Agreed, but the thing you want to use this for isn’t simulating a long reflection, which will fail (in the worst case) because HCH can’t do certain types of learning efficiently.
Once we get past Simulated Long Reflection, there's a whole pile of Things To Do With AI which strike me as Probably Doomed on general principles. You mentioned using HCH to "let humans be epistemically competitive with the systems we're trying to train", which definitely falls in that pile. We have general principles saying that we should definitely not rely on humans being epistemically competitive with AGI; using HCH does not seem to get around those general principles at all. (Unless we buy some very strong hypotheses about humans' skill at factorizing problems, in which case we'd also expect HCH to be able to simulate something long-reflection-like.) Trying to be epistemically competitive with AGI is, in general, one of the most difficult use-cases one can aim for. For that to be easier than simulating a long reflection, even for architectures other than HCH-emulators, we'd need some really weird assumptions.

Excellent post! This seems like a highly promising and under-explored line of attack. I've had some vaguely similar thoughts over the years, but you've done a far better job articulating and developing a coherent programme. Bravo!

I think my biggest intuitive disagreement might be with whether it is likely to be possible to create some sort of efficient 'abstraction thermometer' or 'agency thermometer'. Searching for possible ways of finding agents or abstractions in a system seems like a prototypical np-hard search problem. Now in practice it's often possible to solve such problems efficiently, but the setting with agents seems especially problematic in that keeping yourself obfuscated can be instrumentally useful, so I suspect the instances we're confronted with in the real world may be adversarially selected to be inscrutable to fast search methods in general.

5Charlie Steiner
I'm also interested in what goes on the other side of the equation.How are you defining what to search for in the first place? If you point your abstraction detector at an AI and it outputs "this AI has a concept of trees," how do you gain confidence that the "trees" according to the AI (and according to your abstraction detector) are more or less what you mean by trees? Some ad-hoc methods spring to mind, but I'm not sure what John would say.
This is my largest concern too: that we might find a principled-but-inefficient tools that give guarantees, but be unable to find any efficient approximation that doesn't lose those guarantees. However, I do think there are reasons to be cautiously optimistic, conditional on gaining a solid theoretical understanding [just my impressions: confusion entirely possible]: 1. We get to pick the structure we're searching over - the only real constraint being that it has to perform competitively. It wouldn't matter that the 'thermometers' were inefficient in 99% of cases, just so long as we were able to find at least one kind of structure combining thermometer-efficiency and performance. If the required [thermometer-friendly] property can be formally specified, it may be possible to incorporate it as a training constraint. 2. So long as we can use the tools to prevent adversarial situations from arising in the first place, we don't need to meet the bar of working in the face of super-human adversarial selection (I think it's a good idea to view getting into that situation as a presumed loss condition). 3. In principle, greater theoretical understanding may give us more than just 'thermometers' - e.g. we might hope to find operators that preserve particular agency-related safety properties. If updates could be applied in terms of such operators, that may reduce the required frequency of slower tests. [the specifics may not look like this, but a solid theoretical understanding would usually be expected to help you avoid problems in various ways, not only to test for them]

s an e-coli an agent? Does it have a world-model, and if so, what is it? Does it have a utility function, and if so, what is it? Does it have some other kind of “goal”

That's the part I find puzzling in terms of lack of time devoted to it: how can one talk about agency without figuring out the basics like that. Though I personally argued that it might not even be possible to do in this post, which conjectured that vapor bubbles"maximizing their volume" in a pot of boiling water are not qualitatively different from bacteria going against sugar gradient in search of food.

It's hard to articulate exactly why, but I feel like "utility-maximizing agent(s)" is not the right frame to think about AI in. You can fit a utility function to any sequence of 'actions' an 'agent' makes, so the abstraction "utility function" has no real power to predict the 'actions' of an 'agent'. There's also the fundamental human bias of ascribing agency to non-agentic systems (the weather, printers).

Great post. To the extent that progress can be made on this, it seems extremely important.

A question on your HCH scepticism: 

going to fail in basically the same ways as real bureaucracies and for basically the same underlying reasons

I'd be interested if you could elaborate on that. To me it seems HCH shares some elements of bureaucracy, but that there are important differences.

My thoughts:

  1. They share the property of not reliably optimising for the task they're given (HCH is best considered a sovereign, not an oracle: it's an oracle iff it wants to
... (read more)
Response here.
1[comment deleted]

I strongly agree with your focus on ambitious value learning, rather than approaches that focus more on control (e.g., myopia). What we want is an AGI that can robustly identify humans (and I would argue, any agentic system), determine their values in an iteratively improving way, and treat these learned values as its own. That is, we should be looking for models where goal alignment and a desire to cooperate with humanity is situated within a broad basin of attraction (like how corrigibility is supposed to work), where any misalignment that the AGI notice... (read more)

Interesting observation on the above post! Though I do not read it explicitly in John's Plan, I guess you can indeed implicitly read that John's Plan rejects routes to alignment that focus on control/myopia, routes that do not visit step 2.of successfully solving automatic/ambitious value learning first. John, can you confirm this? Background: my own default Plan does focus on control/myopia. I feel that this line of attack for solving AGI alignment (if we ever get weak or strong AGI) is reaching the stage where all the major points of 'fundamental confusion' have been solved. So for me this approach represents the true 'easier strategy'.
3Jon Garcia
It's quite possible that control is easier than ambitious value learning, but I doubt that it's as sustainable. Approaches like myopia, IDA, or HCH would probably get you an AGI that is aligned to much higher levels of intelligence than doing without them, all else being equal. But if there is nothing pulling its motivations explicitly back toward a basin of value alignment, then I feel like these approaches would be prone to diverging from alignment at some level beyond where any human could tell what's going on with the system. I do think that methods of control are worthwhile to pursue over the short term, but we had better be simultaneously working on ambitious value learning in the meantime for when an ASI inevitably escapes our control anyway. Even if myopia, for instance, worked perfectly to constrain what some AGI is able to conspire, it still seems likely that someone, somewhere, will try fiddling around with another AGI's time horizon parameters and cause a disaster. It would be better if AGI models, from the beginning, had at least some value learning system built in by default to act as an extra safeguard.
I agree in general that pursuing multiple alternative alignment approaches (and using them all together to create higher levels of safety) is valuable. I am more optimistic than you that we can design control systems (different from time horizon based myopia) which will be stable and understandable even at higher levels of AGI competence. Well, if you worry about people fiddling with control system tuning parameters, you also need to worry about someone fiddling with value learning parameters so that the AGI will only learn the values of a single group of people who would like to rule the rest of the world. Assming that AGI is possible, I believe it is most likely that Bostrom's orthogonality hypothesis will hold for it. I am not optimistic about desiging an AGI system which is inherently fiddle-proof.

This post is one of the LW posts a younger version of myself would have been most excited to read. Building on what I got from the Embedded Agency sequence, this post lays out a broad-strokes research plan for getting the alignment problem right. It points to areas of confusion, it lists questions we should be able to answer if we got this right, it explains the reasoning behind some of the specific tactics the author is pursuing, and it answers multiple common questions and objections. It leaves me with a feeling of "Yeah, I could pursue that too if I wanted, and I expect I could make some progress" which is a shockingly high bar for a purported plan to solve the alignment problem. I give this post +9.

If you are looking for a very general yet simple model of agency or at least decision making you might want to have a look at The geometry of decision-making in individuals and collectives.

While capturing known, generic features of neural integration, our model is deliberately minimal. This serves multiple purposes. First, following principles of maximum parsimony, we seek to find a simple model that can both predict and explain the observed phenomena. Second, we aim to reveal general principles and thus, consider features that are known to be valid across

... (read more)
4Daniel Kokotajlo
This is big if true! I skimmed that paper and didn't understand its generality. It seems to be a model of how dumb animals and groups of dumb animals make decisions between desired places to be, as they approach a cluster of different desired places to be. The interesting upshot is that instead of picking one option as the best and heading straight for it, they make a series of binary choices. Can you perhaps help me understand -- is this supposed to generalize to humans and AGIs also? And is it supposed to generalize to choices that aren't about where to travel when travelling fast towards a cluster of desirable destinations? If so, do you think you see how, and would you be willing to explain it to me?
Happy New Year. Based on the paper I would predict that it applies to human sub conscious decision making. I'm unsure if it applies to conscious decisions. For AI it depends on the approach chosen.
3Jon Garcia
This looks really interesting. The first thought that jumped to mind was how this geometric principle might extend to abstract goal space in general. There is research suggesting that savannah-like environments may have provided human evolution ideal selective pressures for developing the cognitive tools necessary for making complex plans. Becoming adept at navigating physical scenes with obstacles, predators, refuges, and prey gave humans the right kind of brain architecture for also navigating abstract spaces full of abstract goals, anti-goals (bad outcomes to avoid), obstacles, and paths (plans). The "geometric decision making" in the paper was studied for physical spaces, but I could imagine that animal minds (including humans) use such a bifurcation method in other goal spaces as well. In other words, agents would start out traversing state space toward the average of multiple, moderately distant goals (seeking a state from which multiple goals are still achievable), then would switch to choosing a sub-cluster of the goals to pursue once they get close enough (the binary decision / bifurcation point). This would iterate until the agent has only one easily achievable goal in front of it. My guess is that this strategy would be safer than choosing a single goal among many at the outset of planning (e.g., the one goal with the highest expected utility upon achievement). If the situation changes while the agent is in the middle of pursuing a goal, it might find itself too far away from any other goal to make up for the sunk cost. If instead it had been pursuing some sort of multi-goal-centroid state, it could still achieve a decent alternative goal even when what would have been its first choice ceases to be an option. As it gets closer to the multi-goal-centroid, it can afford to focus on just a subset (or just a single goal), since it knows that other decent options are still nearby in state space.

"Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are [broad and robust]"

I understand why you think that broad is true. But I'm not sure I get robust. In fact, robust seems to make intuitive dis-sense to me. Your examples are gradient descent and evolution, neither of which have memory, so, how would they be able to know how "robust" an optima is? Part of me thinks that the idea comes from how, if a system optimized for a non-robust optima, it wouldn't internally be doing anything different, but we... (read more)

If we're just optimizing some function, then indeed breadth is the only relevant part. But for something like evolution or SGD, we're optimizing over random samples, and it's the use of many different random samples which I'd expect to select for robustness.
Maybe I misunderstand your use of robust, but this still seems to me to be breadth. If an optima is broader, samples are more likely to fall within it. I took broad to mean "has a lot of (hyper)volume in the optimization space", and robust to mean "stable over time/perturbation". I still contend that those optimization processes are unaware of time, or any environmental variation, and can only select for it in so far as it is expressed as breadth. The example I have in my head is that if you had an environment, and committed to changing some aspect of it after some period of time, evolution or SGD would optimize the same as if you had committed to a different change. Which change you do would affect the robustness of the environmental optima, but the state of the environment alone determines their breadth. The processes cannot optimize based on your committed change before it happens, so they cannot optimize for robustness. Given what you said about random samples, I think you might be working under definitions along the lines of "robust optima are ones that work in a range of environments, so you can be put in a variety of random circumstances, and still have them work" and (at this point I struggled a bit to figure out what a "broad" optima would be that's different, and this is what I came up with?) "broad optima are those that you can do approximately and still get a significant chunk of the benefit." I feel like these can still be unified into one thing, because I think approximate strategies in fixed environments are similar to fixed strategies in approximate environments? Like moving a little to the left is similar to the environment being a little to the right?

“It’s not that we need formalizations per se; it’s that we need gears-level understanding. We need to have some understanding of why e.g. modularity shows up in trained/evolved systems, what precisely makes that happen. The need for gears-level understanding, in turn, stems from the need for generalizability.”

In this case the most straightforward approach would be to simply derive e.coli behaviour from basic quantum chemistry, as that is the closest field where fully deterministic simulations are possible, and verifiable.

The gap between simulating hydrogen... (read more)

I think having that post on the AF would be very good. ;)

Didn't want to scare people away with the "may contain technical blah de blah" header. I'll crosspost it to AF in a few days.

In general a great piece. One thing that I found quite relatable is the point about the preparadigmatic stage of AI safety going into later stages soon. It feels like this is already happening to some degree where there are more and more projects readily available, more prosaic alignment and interpretability projects at large scale, more work done in multiple directions and bigger organizations having explicit safety staff and better funding in general.

With these facts, it seems like there's bound to be a relatively big phase shift in research and action within the field that I'm quite excited about.

Regarding modularity - you might be interested in my Motivations, Natural Selection, and Curriculum Engineering > Modularity of Capability Accumulation from last week - it has a few speculations and (probably more usefully) a couple of references you might like (including one I stole from you).

To me the biggest parallel I see in this to existing work is to that of program correctness. It is as hard IMHO to prove program correctness (as in: this program is supposed to sort records/extract every record with inconsistent ID numbers/whatever, and actually does) as it is to write the program correctly; actually, I think it's harder. So I never pursued it. Now we see a really good reason to pursue it. And even w/ conventional, non-AI programs, we have the problem of precisely defining what we want done.

Proving program correctness seems closer to the MIRI approach to me.

Hypothesis regarding your confusion about agency:

Describing humans using a "utility function" or through "goals" is wrong.

Humans are a bunch of habits (like CFAR TAPs) which have some correlation with working towards goals, but this is more of an imperfect rationalization than a reasonable/natural way to describe the situation.

Also yes, we have some part that thinks in goals, but it has a very limited effect on anything (like actions) compared to what we'd naturally think.

Credit to a friend

[I have no idea what I'm talking about, feel free to ignore if this doesn't resonate of course, seemed worth a comment]

I'm perpetually surprised by the amount of thought that goes into this sort of thing coupled with the lack of attention to the philosophical literature on theories of mind and agency in the past, let's just say 50 years. I mean look at the entire debate around whether or not it's possible to naturalize normativity - most of the philosophical profession has given up on this or accepts the question was at best too hard to answer, at worst, ill-conceived from the start.

These literatures are very aware of, and conversant with, the latest and greatest in cogsci... (read more)

Do you have any thoughts on chess computers, guided missiles, computer viruses, etc, and whether they make a case for worries about AGI,  even if you consider them something alien to the human kind of intelligence? 
0Blake H.
No - but perhaps I'm not seeing how they would make the case. Is the idea that somehow their existence augurs a future in which tech gets more autonomous to a point where we can no longer control it? I guess I'd say, why should we believe that's true? Its probably uncontroversial to believe many of our tools will get more autonomous - but why should we think that'll lead to the kind of autonomy we enjoy?  Even if you believe that the intelligence and autonomy we enjoy exist on a kind of continuum, from like single celled organisms through chess-playing computers, to us - we'd still need reason to believe that the progress along this continuum will continue at a rate necessary to close the gap between where we sit on the continuum and where our best artifacts currently sit on the continuum. I don't doubt that progress will continue; but even if the continuum view were right, I think we sit way further out on the continuum than most people with the continuum view think. Also, the continuum view itself is very, very controversial. I happen to accept the arguments which aim to show that it faces insurmountable obstacles. The alternate view which I accept is that there's a difference in kind between the intelligence and autonomy we enjoy, and the kind enjoyed by non-human animals and chess-playing computers. Many people think that if we accept that, we have to reject a certain form of metaphysical naturalism (e.g. the view that all natural phenomena can be explained in terms of the basic conceptual tools of physics, maths, and logic).  Some people think that this form of metaphysical naturalism is bedrock stuff; that if we don't accept it, the theists win, blah blah blah, so we must naturalize mentality and agency, it must exist on a continuum, we just need a theory which shows us how. Other people think we can have a non-reductive naturalism which takes as primitive the normative concepts found in biology and psychology. That's the view I hold. So no, I don't think th
I don't know... If I try to think of Anglophone philosophers of mind who I respect, I think of "Australian materialists" like Armstrong and Chalmers. No doubt there are plenty of worthwhile thoughts among the British, Americans, etc too, but you seem to be promoting something I deplore, the attempt to rule out various hard problems and unwelcome possibilities, by insisting that words shouldn't be used that way. Celia Green even suggested that this 1984-like tactic could be the philosophy of a new dark age in which inquiry was stifled, not by belief in religion, but by "belief in society"; but perhaps technology has averted that future. Head-in-the-sand anthropocentrism is hardly tenable in a world where, already, someone could hook up a GPT3 chatbot to a Boston Dynamics chassis, and create an entity from deep within the uncanny valley. 
5Blake H.
Totally get it. There are lots of folks practicing philosophy of mind and technology today in that aussie tradition who I think take these questions seriously and try to cache out what we mean when we talk about agency, mentality, etc. as part of their broader projects. I'd resist your characterization that I'm insisting words shouldn't be used a particular way, though I can understand why it might seem that way. I'm rather hoping to shed more light on the idea raised by this post that  we don't actually know what many of these words even mean when they're used in certain ways (hence the authors totally correct point about the need to clarify confusions about agency while working on the alignment problem). My whole point in wading in here is just to point out to a thoughtful community that there's a really long rich history of doing just this, and even if you prefer the answers given by aussie materialists, it's even better to understand those positions vis-a-vis their present and past interlocutors. If you understand those who disagree with them, and can articulate those positions in terms they'd accept, you understand your preferred positions even better. I wouldn't say I deplore it, but I am always mildly amused when cogsci, compsci, and stats people start wading into plainly philosophical waters ("sort out our fundamental confusions about agency") and talk as if they're the first ones to get there - or the only ones presently splashing around. I guess I would have thought (perhaps naively) that on a site like this people would be at least curious to see what work has already been done on the questions so they can accelerate their inquiry. Re: ruling out hard problems - lot's of philosophy is the attempt to better understand the problem's framing such that it either reduces to a different problem, or disappears altogether. I'd urge you to see this as an example of that kind of thing, rather than ruling out certain questions from the gun. And on anthropocentris
2Daniel Kokotajlo
Those articles are all paywalled; got free versions? I tried Sci-Hub, no luck.
? The second is already open-access, and the third both works in SH & GS (with 2 different PDF links). Only the first link fails in SH. (But what an abstract: "I also argue that if future generally intelligent AI possess a predictive processing cognitive architecture, then they will come to share our pro-moral motivations (of valuing humanity as an end; avoiding maleficent actions; etc.), regardless of their initial motivation set." Wow.)
5Daniel Kokotajlo
Huh, I tried the first and third in SH, maybe I messed up somehow. My bad. Thanks! I still am interested in the first (on the principle that maybe, just maybe, it's the solution to all our problems instead of being yet another terrible argument made by philosophers about why AIs will be ethical by default if only we do X... I think I've seen two already) and would like to have access.
1Jon Garcia
I can see how that would work. The author needs to be careful, though. Predictive processing may be a necessary condition for robust AGI alignment, but it is not per se a sufficient condition. First of all, that only works if you give the AGI strong inductive priors for detecting and predicting human needs, goals, and values. Otherwise, it will tend to predict humans as though we are just "physical" systems (we are, but I mean modeling us without taking our sentience and values into account), no more worthy of special care than rocks or streams. Second of all, this only works if the AGI has a structural bias toward treating the needs, goals, and values that it infers from predictive processing as its own. Otherwise, it may understand how to align with us, but it won't care by default.
Why was this downvoted? Sheesh!
What do you mean by "naturalize" as a verb? What is "naturalizing normativity"? Does this amount to you thinking that humans are humans because of some influence from outside of fundamental physics, which computers and non-human animals don't share?
What's important is that it means coming up with a detailed, step-by-step explanation of how some high level concepts like life, shouldness, and intelligence. Just believing that they are natural is not the required explanation. Believing they are unnatural is not the only reason to disbelieve in the possibility of a reduction. Reductionism is not just the claim that things are made out of parts. It's a claim about explanation, and humans might not be smart enough to perform certainly reductions .
So basically the problem is that we haven't got the explanation yet and can't seem to find it with a philosopher's toolkit? People have figured out a lot of things (electromagnetism, quantum physics, airplanes, semiconductors, DNA, visual cortex neuroscience) by mucking with physical things while having very little idea of them beforehand by just being smart and thinking hard. Seems like figuring out human concepts grounding to physics has a similar blocker, we still don't have good enough neuroscience to do a simulation of how the brain goes from neurons to high-level thoughts (where you could observe a simulated brain-critter doing human-like things in a VR environment to tell you're getting somewhere even when you haven't reverse-engineered the semantics of the opaque processes yet). People having that kind of model to look at and trying to make sense of it could come up with all sorts of new unobvious useful concepts, just like people trying to figure out quantum mechanics came up with all sorts of new unobvious useful concepts. But this doesn't sound like a fun project for professional philosophers, a research project like that would need many neuroscientists and computer scientists and not very many philosophers. So if philosophers show up, look at a project like that, and go "this is stupid and you are stupid, go read more philosophy", I'm not sure they're doing it out of purely dispassionate pursuit of wisdom.
Philosophers are not of a single mind. Some are reductionists, some are illusionists, and so on.
1Blake H.
Good - though I'd want to clarify that there are some reductionists who think that there must be a reductive explanation for all natural phenomena, even if some will remain unknowable to us (for practical or theoretical reasons). Other non-reductionists believe that the idea of giving a causal explanation of certain facts is actually confused - it's not that there is no such explanation, it's that the very idea of giving certain kinds of explanation means we don't fully understand the propositions involved. E.g. if someone were to ask why certain mathematical facts are true, hoping for a causal explanation in terms of brain-facts or historical-evolutionary facts, we might wonder whether they understood what math is about.
1Blake H.
Naturalizing normativity just means explaining normative phenomena in terms of other natural phenomena whose existence we accept as part of our broader metaphysics. E.g. explaining biological function in terms of evolution by natural selection, where natural selection is explained by differential survival rates and other statistical facts. Or explaining facts about minds, beliefs, attitudes, etc., in terms of non-humoncular goings-on in the brain. The project is typically aimed at humans, but shows up as soon as you get to biology and the handful of normative concepts (life, function, health, fitness, etc.) that constitute its core subject matter. Hope that helps. 
I don't think I've seen the term "normative phenomena" before. So basically normative concepts are concepts in everyday language ("life", "health"), which get messy if you try to push them too hard? But what are normative phenomena then? We don't see or touch "life" or "health", we see something closer to the actual stuff going on in the world and then we come up with everyday word-concepts for it that sort of work until they don't. It's not really helping in that I still have no real intuition about what you're going on about, and your AI critique seems to be aimed at something from 30 years ago instead of contemporary stuff like Omohundro's Basic AI Drives paper (you describe AIs as being "without the desire to evade death, nourish itself, and protect a physical body", the paper's point is that AGIs operating in the physical world would have exactly that) or the whole deep learning explosion with massive datasets of the last few years ("we under-estimate by many orders of magnitude the volume of inputs needed to shape our “models.”", right now people are in a race to feed ginormous input sets to deep learning systems and probably aren't stopping anytime soon). Like, yeah. People can be really impressive, but unless you want to make an explicit case for the contrary, people here still think people are made of parts and there exists some way to go from a large cloud of hydrogen to people. If you think there's some impossible gap between the human and the nonhuman worlds, then how do you think actual humans got here? Right now you seem to be just giving some sort of smug shrug of someone who on one hand doesn't want to ask that question themselves because it's corrosive to dignified pre-Darwin liberal arts sensibilities, and on the other hand tries to hint at people genuinely interested in the question that it's a stupid question to ask and they should have read better scholarship to convince themselves of that.
1Blake H.
  There are many types of explanatory claims in our language. Some are causal (how did something come to be), others are constitutive (what is it to be something), others still are normative (why is something good or right). Most mathematical explanation is constitutive, most action explanation is rational, and most material explanation is causal. It's totally possible to think there's a plain causal explanation about how humans evolved (through a combination of drift and natural selection, in which proportion we will likely never know) - while still thinking that the prospects for coming up with a constitutive explanation of normativity are dim (at best) or outright confused (at worst). A common project shape for reductive naturalists is to try and use causal explanations to form a constitutive explanation for the normative aspects of biological life. If you spend enough time studying the many historical attempts that have been made at these explanations, you begin to see this pattern emerge where a would-be reductive theorist will either smuggle in a normative concept to fill out their causal story (thereby begging the question), or fail to deliver a theory which has the explanatory power to make basic normative distinctions which we intuitively recognize and that the theory should be able to account for (there are several really good tests out there for this - see the various takes on rule-following problems developed by Wittgenstein). Terms like "information" "structure" "fitness" "processing" "innateness" and the like all are subject to this sort of dilemma if you really put them under scrutiny. Magic non-natural stuff (like souls or spirit or that kind of thing) are often devices that people have reached for when forced on to this dilemma. Postulating that kind of thing is just the other side of the coin, and makes exactly the same error. So I guess I'd say, I find it totally plausible how normative phenomena could be sui generis in much the same way that m
If we believe there is a plain causal explanation, that rules out some explanations we could imagine. It shouldn't now be possible for humans to have been created by a supernatural agency (as was widely thought in Antiquity, the Middle Ages or Renaissance when most of the canon of philosophy was developed), and basic human functioning probably shouldn't involve processes wildly contrary to known physics (still believed by some smart people like Roger Penrose). The other aspect is computational complexity. If we assume the causal explanation, we also get quantifiable limits for how much evolutionary work and complexity can have gone into humans. People are generally aware that there's a lot of it, and a lot less aware that it's quantifiably finite. The size of the human genome, which we can measure, creates one hard limit on how complex a human being can be. The limited amount of sensory information a human can pick up growing to adulthood and the limited amount of computation the human brain can do during that time creates another. Evolutionary theory also gives us a very interesting extra hint that everything you see in nature should be reachable by a very gradual ascent of slightly different forms, all of which need to be viable and competitive, all the way from the simplest chemical replicators. So that's another limit to the bin, whatever is going on with humans is probably not something that has to drop out of nowhere as a ball of intractable complexity, but can be reached by some series of small enough to be understandable improvements to a small enough to be understandable initial lifeform. The entire sphere of complex but finite computational processes has been a blind spot for philosophy. Nobody really understood it until computers had become reasonably common. (Dennett talks about this in Darwin's Dangerous Idea when discussion Conway's Game of Life.) Actually figuring things out from the opaque blobs of computation like human DNA is another problem of c
2Blake H.
Yeah, I agree with a lot of this. Especially: I take it that this is how most progress in artificial intelligence, neuroscience, and cogsci has (and will continue) to proceed. My caution - and whole point in wading in here - is just that we shouldn't expect progress by trying to come up with a better theory of mind or agency, even with more sophisticated explanatory tools. I think it's totally coherent and likely even that future artificial agents (generally intelligent or not) will be created without a general theory of mind or action.  In this scenario, you get a complete causal understanding of the mechanisms that enable agents to become minded and intentionally active, but you still don't know what that agency or intelligence consist in beyond our simple, non-reductive folk-psychological explanations. A lot of folks in this scenario would be inclined to say, "who cares, we got the gears-level understanding" and I guess the only people who would care would be those who wanted to use the reductive causal story to tell us what it means to be minded. The philosophers I admire (John McDowell is the best example) appreciate the difference between causal and constitutive explanations when it comes to facts about minds and agents, and urge that progress in the sciences is hindered by running these together. They see no obstacle to technical progress in neuroscientific understanding or artificial intelligence; they just see themselves as sorting out what these disciplines are and are not about. They don't think they're in the business of giving constitutive explanations of what minds and agents are, rather, they're in the business of discovering what enable minds and agents to do their minded and agential work. I think this distinction is apparent even with basic biological concepts like life. Biology can give us a complete account of the gears that enable life to work as it does without shedding any light on what makes it the case that something is alive, functioning