This is a high-level overview of the reasoning behind my research priorities, written as a Q&A.
What’s your plan for AI alignment?
Step 1: sort out our fundamental confusions about agency
Step 2: ambitious value learning (i.e. build an AI which correctly learns human values and optimizes for them)
Step 3: …
Step 4: profit!
… and do all that before AGI kills us all.
That sounds… awfully optimistic. Do you actually think that’s viable?
Better than a 50/50 chance of working in time.
Do you just have really long timelines?
No. My median is maybe 10-15 years, though that’s more a gut estimate based on how surprised I was over the past decade rather than a carefully-considered analysis. (I wouldn’t be shocked by another AI winter, especially on an inside view, but on an outside view the models generating that prediction have lost an awful lot of Bayes Points over the past few years.)
Mostly timelines just aren’t that relevant; they’d have to get down to around 18-24 months before I think it’s time to shift strategy a lot.
… Wat. Not relevant until we’re down to two years?!?
To be clear, I don’t expect to solve the whole problem in the next two years. Rather, I expect that even the incremental gains from partial progress on fundamental understanding will be worth far more than marginal time/effort on anything else, at least given our current state.
At this point, I think we’re mostly just fundamentally confused about agency and alignment. I expect approximately-all of the gains-to-be-had come from becoming less confused. So the optimal strategy is basically to spend as much time as possible sorting out as much of that general confusion as possible, and if the timer starts to run out, then slap something together based on the best understanding we have.
18-24 months is about how long I expect it to take to slap something together based on the best understanding we have. (Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)
But iterative engineering is important!
In order for iterative engineering to be useful, we first need to have a strong enough understanding of what we even want to achieve in order to recognize when an iteration has brought us a step closer to the goal. No amount of A/B testing changes to our website will make our company profitable if we’re measuring the wrong metrics. I claim that, for alignment, we do not yet have a strong enough understanding for iteration to produce meaningful progress.
When I say “we’re just fundamentally confused about agency and alignment”, that’s the sort of thing I’m talking about.
To be clear: we can absolutely come up with proxy measures of alignment. The problem is that I don’t expect iteration under those proxy measures to get us meaningfully closer to aligned AGI. No reasonable amount of iterating on gliders’ flight-range will get one to the moon.
But engineering is important for advancing understanding too!
I do still expect some amount of engineering to be central for making progress on fundamental confusion. Engineering is one of the major drivers of science; failed attempts to build amplifiers drove our first decent understanding of semiconductors, for instance. But this is a very different path-to-impact than directly iterating on “alignment”, and it makes sense to optimize our efforts differently if the path-to-impact is through fundamental understanding. Just take some confusing concept which is fundamental to agency and alignment (like abstraction, or optimization, or knowledge, or …) and try to engineer anything which can robustly do something with that concept. For instance, a lot of my own work is driven by the vision of a “thermometer of abstraction”, a device capable of robustly and generalizably measuring abstractions and presenting them in a standard legible format. It’s not about directly iterating on some alignment scheme, it’s about an engineering goal which drives and grounds the theorizing and can be independently useful for something of value.
Also, the theory-practice gap is a thing, and I generally expect the majority of “understanding” work to go into crossing that gap. I consider such work a fundamental part of sorting out confusions; if the theory doesn’t work in practice, then we’re still confused. But I also expect that the theory-practice gap is only very hard to cross the first few times; once a few applications work, it gets much easier. Once the first field-effect transistor works, it’s a lot easier to come up with more neat solid-state devices, without needing to further update the theory much. That’s why it makes sense to consider the theory-practice gap a part of fundamental understanding in its own right: once we understand it well enough for a few applications, we usually understand it well enough to implement many more with much lower marginal effort.
An analogy: to go from medieval castles to skyscrapers, we don’t just iterate on stone towers; we leverage fundamental scientific advances in both materials and structural engineering. My strategy for building the tallest possible metaphorical skyscraper is to put all my effort into fundamental materials and structural science. That includes testing out structures as-needed to check that the theory actually works, but the goal there is understanding, not just making tall test-towers; tall towers might provide useful data, but they’re probably not the most useful investment until we’re near the end-goal. Most of the iteration is on e.g. metallurgy, not on tower-height directly. Most of the experimentation is on e.g. column or beam loading under controlled conditions, again not on tower-height directly. If the deadline is suddenly 18-24 months, then it’s time to slap together a building with whatever understanding is available, but hopefully we figure things out fast enough that the deadline isn’t that limiting of a constraint.
What do you mean by “fundamentally confused”?
My current best explanation of “fundamental confusion” is that we don’t have the right frames. When thinking about agency or alignment, we do not know:
- What are the most important questions to ask?
- What approximations work?
- What do we need to pay attention to, and what can we safely ignore?
- How can we break the problem/system up into subproblems/subsystems?
For all of these, we can certainly make up some answers. The problem is that we don’t have answers to these questions which seem likely to generalize well. Indeed, for most current answers to these questions, I think there are strong arguments that they will not generalize well. Maybe we have an approximation which works well for a particular class of neural networks, but we wouldn’t expect it to generalize to other kinds of agenty systems (like e.g. a bacteria), and it’s debatable whether it will even apply to future ML architectures. Maybe we know of some possible failure modes for alignment, but we don’t know which of them we need to pay attention to vs which will mostly sort themselves out, especially in future regimes/architectures which we currently can’t test. (Even more important: there’s only so much we can pay attention to at all, and we don’t know what details are safe to ignore.) Maybe we have a factorization of alignment which helps highlight some particular problems, but the factorization is known to be leaky; there are other problems which it obscures.
By contrast, consider putting new satellites into orbit. At this point, we generally know what the key subproblems are, what approximations we can make, what to pay attention to, what questions to ask. Most importantly, we are fairly confident that our framing for satellite delivery will generalize to new missions and applications, at least in the near-to-medium-term future. When someone needs to put a new satellite in orbit, it’s not like the whole field needs to worry about their frames failing to generalize.
(Note: there’s probably aspects of “fundamental confusion” which this explanation doesn’t capture, but I don’t have a better explanation right now.)
What are we fundamentally confused about?
We’ve already talked about one example: I think we currently do not understand alignment well enough for iterative engineering to get us meaningfully closer to solving the real problem, in the same way that iterating on glider range will not get one meaningfully closer to going to the moon. When iterating, we don’t currently know which questions to ask, we don’t know which things to pay attention to, we don’t know which subproblems are bottlenecks.
Here’s a bunch of other foundational problems/questions where I think we currently don’t know the right framing to answer them in a generalizable way:
- Is an e-coli an agent? Does it have a world-model, and if so, what is it? Does it have a utility function, and if so, what is it? Does it have some other kind of “goal”?
- What even are "human values"? What’s the type signature of human values?
- Given two agents (with potentially completely different world models), how can I tell whether one is "trying to help" the other? What does that even mean?
- Given a trained neural network, does it contain any subagents? What are their world-models, and what do they want?
- Given an atomically-precise scan of a whole human brain, body, and local environment, and unlimited compute, calculate the human’s goals/wants/values, in a manner legible to an automated optimizer.
- Given some physical system, identify any agents in it, and what they’re optimizing for.
- Back out the learned objective of a trained neural net, and compare it to the training objective.
What kinds of “incremental progress” do you have in mind here?
As an example, I’ve spent the last couple years better understanding abstraction (and I’m currently working to push that across the theory-practice gap). It’s a necessary component for the sorts of questions I want to answer about agency in general (like those above), but in the nearer term I also expect it to provide very strong ML interpretability tools. (This is a technical thing, but if you want to see the rough idea, take a look at the Telephone Theorem post and imagine that the causal models are computational circuits for neural nets. There are still some nontrivial steps after that to adapt the theorem to neural nets, but it should convey the general idea, and it's a very simple theorem.) If I found out today that AGI was two years away, I’d probably spend a few more months making the algorithms for abstraction-extraction as efficient as I could get them, then focus mainly on applying it to interpretability.
(What I actually expect/hope is that I’ll have efficient algorithms demo-ready in the first half of next year, and then some engineers will come along and apply them to interpretability while I work on other things.)
Another example: the next major thing to sort out after abstraction will be when and why large optimized systems (e.g. neural nets or biological organisms) are so modular, and how the trained/evolved modularity corresponds to modular structures in the environment. I expect that will yield additional actionable insights into ML interpretability, and especially into what environmental/training features lead to more transparent ML models.
Ok, the incremental progress makes sense, but the full plan still sounds ridiculously optimistic with 10-15 year timelines. Given how slow progress has been on the foundational theory of agency (especially at MIRI), why do you expect it to go so much faster?
Mostly I think MIRI has been asking not-quite-the-right-questions, in not-quite-the-right-ways.
Not-quite-the-right-questions: when I look at MIRI’s past work on agent foundations, it’s clear that the motivating questions were about how to build AGI which satisfies various desiderata (e.g. stable values under self-modification, corrigibility, etc). Trying to understand agency-in-general was mostly secondary, and was not the primary goal guiding choice of research directions. One clear example of this is MIRI’s work on proof-based decision theories: absolutely nobody would choose this as the most-promising research direction for understanding the decision theory used by, say, an e-coli. But plenty of researchers over the years have thought about designing AGI using proof-based internals.
I’m not directly thinking about how to design an AGI with useful properties. I’m trying to understand agenty systems in general - be it humans, ML systems, e-coli, cats, organizations, markets, what have you. My impression is that MIRI’s agent foundations team has started to think more along these lines over time (especially since Embedded Agency came out), but I think they’re still carrying a lot of baggage.
… which brings us to MIRI tackling questions in not-quite-the-right-ways. The work on Tiling Agents is a central example here: the problem is to come up with models for agents which copy themselves, so copies of the agents “tile” across the environment. When I look at that problem through an “understand agency in general” lens, my immediate thought is “ah, this is a baseline model for evolution”. Once we have a good model for agents which “reproduce” (i.e. tile), we can talk about agents which approximately-reproduce with small perturbations (i.e. mutations) and the resulting evolutionary process. Then we can go look at how evolution actually behaves to empirically check our models.
When MIRI looks at the Tiling Agents problem, on the other hand, they set it up in terms of proof systems proving things about “successor” proof systems. Absolutely nobody would choose this as the most natural setup to talk about evolution. It’s a setup which is narrowly chosen for a particular kind of “agent” (i.e. AI with some provable guarantees) and a particular use-case (i.e. maintaining the guarantees when the AI self-modifies).
Main point: it does not look like MIRI has primarily been trying to sort out fundamental confusions about agency-in-general, at least not for very long; that’s not what they were optimizing for. Their work was much more narrow than that. And this is one of those cases where I expect the more-general theory to be both easier to find (because we can use lots of data from existing agenty systems in biology, economics and ML) and more useful (because it will more likely generalize to many use-cases and many kinds of agenty systems).
Side note: contrary to popular perception, MIRI is an extremely heterogeneous org, and the criticisms above apply to different people at different times to very different degrees. That said, I think it’s a reasonable representation of the median past work done at MIRI. Also, MIRI is still the best org at this sort of thing, which is why I’m criticizing them in particular.
What’s the roadmap?
Abstraction is the main foundational piece (more on that below). After that, the next big piece will be selection theorems, and I expect to ride that train most of the way to the destination.
Regarding selection theorems: I think most of the gap between aspects of agency which we understand in theory, and aspects of agenty systems which seem to occur consistently in practice, come from broad and robust optima. Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are “broad”: optima whose basins fill a lot of parameter/genome space. And they find optima which are robust: small changes in the distribution of the environment don’t break them. There are informal arguments that this leads to a lot of key properties:
- Modularity of the trained/evolved system (which we do indeed see in practice)
- Good generalization properties
- Information compression
… but we don’t have good formalizations of those arguments, and we’ll need the formalizations in order to properly leverage these properties for engineering.
Besides that, there’s also some cruft to clean up in existing theorems around agency. For instance, coherence theorems (i.e. the justifications for Bayesian expected utility maximization) have some important shortcomings, and are incomplete in important ways. And of course there’s also work to be done on the theoretical support structure for all this - for instance, sorting out good models of what optimization even means.
Why do we need formalizations for engineering?
It’s not that we need formalizations per se; it’s that we need gears-level understanding. We need to have some understanding of why e.g. modularity shows up in trained/evolved systems, what precisely makes that happen. The need for gears-level understanding, in turn, stems from the need for generalizability.
Let’s get a bit more concrete with the modularity example. We could try to build some non-gears-level (i.e. black-box) model of modularity in neural networks by training some different architectures in different regimes on different tasks and with different parameters, empirically computing some proxy measure of “modularity” for each trained network, and then fitting a curve to it. This will probably work great right up until somebody tries something well outside of the distribution on which this black-box model was fit. (Those crazy engineers are constantly pushing the damn boundaries; that’s largely why they’re so useful for driving fundamental understanding efforts.)
On the other hand, if we understand why modularity occurs in trained/evolved systems, then we can follow the gears of our reasoning even on new kinds of systems. More importantly, we can design new systems to leverage those gears without having to guess and check.
Now, gears-level understanding need not involve formal mathematics in general. But for the sorts of things I’m talking about here (like modularity or good generalization or information compression in evolved/trained systems), gears-level understanding mostly looks like mathematical proofs, or at least informal mathematical arguments. A gears-level answer to the question “Why does modularity show up in evolved systems?”, for instance, should have the same rough shape as a proof that modularity shows up in some broad class of evolved systems (for some reasonably-general formalization of “modularity” and “evolution”). It should tell us what the necessary conditions are, and explain why those conditions are necessary in such a way that we can modify the argument to handle different kinds of conditions without restarting from scratch.
Why so much focus on abstraction?
Abstraction is a common bottleneck to a whole bunch of problems in agency and alignment. Questions like:
- If I have some system, what’s the right way to carve out a subsystem (which might be an “agent”, or a “world model”, or an “optimizer”, etc)? This should be robust/general enough to let us confidently say things like e.g. “there are no agents embedded in this trained neural net”.
- What kinds-of-things show up in world models? For instance, is an AI likely to have internal notions of “tree” or “rock” or “car” which map to the corresponding human notions, and how closely?
- How can we empirically measure high-level abstract things (like trees or agents) in the real world, in robustly generalizable ways?
- To the extent that humans care about high-level abstract things like trees or cars, rather than quantum fields, how can we formalize that?
- How can we translate the internal concepts used by trained ML systems into human-legible concepts, robustly enough that we won’t miss anything important (or at least can tell if we do)?
… and so forth. The important point isn’t any one of these questions; the important point is that understanding abstraction is a blocker for a whole bunch of different things. That’s what makes it an ideal target to focus on. Once it’s worked out, I expect to be unblocked not just on the above questions, but also on other important questions I haven’t even thought of yet - if it’s a blocker for many things already, it’s probably also a blocker for other things which I haven’t noticed.
If I had to pick one central reason why abstraction matters so much, it’s that we don’t currently have a robust, generalizable and legible way to measure high-level abstractions. Once we can do that, it will open up a lot of tricky conceptual questions to empirical investigation, in the same way that robust, generalizable and legible measurement tools usually open up scientific investigation of new conceptual areas.
But, like, 10-15 years?!?
A crucial load-bearing part of my model here is that agency/alignment work will undergo a phase transition in the next ~5 years. We’ll go from a basically-preparadigmatic state, where we don’t even know what questions to ask or what tools to use to answer them, to a basically-paradigmatic state, where we have a general roadmap and toolset. Or at the very least I expect to have a workable paradigm, whether anyone else jumps on board is a more open question.
There’s more than one possible path here, more than one possible future paradigm. My estimate of “~5 years” comes from eyeballing the current rate of progress, plus a gut feel for how close the frames are to where they need to be for progress to take off.
As an example of one path which I currently consider reasonably likely: abstraction provides the key tool for the phase transition. Once we can take a simulated environment or a trained model or the like, and efficiently extract all the natural abstractions from it, that changes everything. It’ll be like introducing the thermometer to the study of thermodynamics. We’ll be able to directly, empirically answer questions like “does this model know what a tree is?” or “does this model have a notion of human values?” or “is ‘human’ a natural abstraction?” or “are the agenty things in this simulation natural abstractions?” or …. (These won’t be yes/no answers, but they’ll be quantifiable in a standardized and robustly-generalizable way.) This isn’t a possibility I expect to be legibly plausible to other people right now, but it’s one I’m working towards.
Another path: once a few big selection theorems are sorted out (like modularity of evolved systems, for instance) and empirically verified, we’ll have a new class of tools for empirical study of agenty systems. Like abstraction measurement, this has the potential to open up a whole class of tricky conceptual questions to empirical investigation. Things like “what is this bacteria’s world model?” or “are there any subagents in this trained neural network?”. Again, I don’t necessarily expect this possibility to be legibly plausible to other people right now.
To be clear: not all of my “better than 50/50 chance of working in time” comes from just these two paths. I’ve sketched a fair amount of burdensome detail here, and there’s a lot of variations which lead to similar outcomes with different details, as well as entirely different paths. But the general theme is that I don’t think it will take too much longer to get to a point where we can start empirically investigating key questions in robustly-generalizable ways (rather than the ad-hoc methods used for empirical work today), and get proper feedback loops going for improving understanding.
Why ambitious value learning?
It’s the best-case outcome. I mean, c’mon, it’s got “ambitious” right there in the name.
… but why not aim for some easier strategy?
The main possibly-easier strategy for which I don’t know of any probably-fatal failure mode is to emulate/simulate humans working on the alignment problem for a long time, i.e. a Simulated Long Reflection. The main selling point of this strategy is that, assuming the emulation/simulation is accurate, it probably performs at least as well as we would actually do if we tackled the problem directly.
This is really a whole class of strategies, with many variations, most of which involve training ML systems to mimic humans. (Yes, that implies we’re already at the point where it can probably FOOM.) In general, the further the variations get from just directly simulating humans working on alignment basically the way we do now (but for longer), the more possibly-fatal failure modes show up. HCH is a central example here: for some reason a structure whose most obvious name is The Infinite Bureaucracy was originally suggested as an approximation of a Long Reflection. Look, guys, there is no way in hell that The Infinite Bureaucracy is even remotely a good approximation of a Long Reflection. Naming it “HCH” does not make it any less of an infinite bureaucracy, and yes it is going to fail in basically the same ways as real bureaucracies and for basically the same underlying reasons (except even worse, because it’s infinite).
… but the failure of variations does not necessarily mean that the basic idea is doomed. The basic idea seems basically-sound to me; the problem is implementing it in such a way that the output accurately mimics a real long reflection, while also making it happen before unfriendly AGI kills us all.
Personally, I’m still not working on that strategy, for a few main reasons:
- I expect my current strategy to be more competitive. One big advantage of understanding agency in general is that we can apply that understanding to whatever ML/AI progress comes along, even if it ends up looking very different from e.g. GPT-3.
- The Simulated Long Reflection strategy gets more likely to work when we have people for it to mimic who are already far down the road to solving alignment. The further, the better.
- On a gut level, I just don’t expect ML to emulate humans accurately enough for a Simulated Long Reflection to work until we’ve already passed doomsday. (This is probably the cruxiest issue.)
I am generally happy that other people are working on strategies in the Simulated Long Reflection family, and hope that such work continues.
I want to disagree about MIRI.
Mostly, I think that MIRI (or at least a significant subset of MIRI) has always been primarily directed at agenty systems in general.
I want to separate agent foundations at MIRI into three eras. The Eliezer Era (2001-2013), the Benya Era (2014-2016), and the Scott Era(2017-).
The transitions between eras had an almost complete overhaul of the people involved. In spite of this, I believe that they have roughly all been directed at the same thing, and that John is directed at the same thing.
The proposed mechanism behind the similarity is not transfer, but instead because agency in general is a convergent/natural topic.
I think throughout time, there has always been a bias in the pipeline from ideas to papers towards being more about AI. I think this bias has gotten smaller over time, as the agent foundations research program both started having stable funding, and started carrying less and less of the weight of all of AI alignment on its back. (Before going through editing with Rob, I believe Embedded Agency had no mention of AI at all.)
I believe that John thinks that the Embedded Agency document is especially close to his agenda, so I will sta... (read more)
I generally agree with most of this, but I think it misses the main claim I wanted to make. I totally agree that all three eras of MIRI's agent foundations research had some vision of the general theory of agency behind them, driving things. My point of disagreement is that, for most of MIRI's history, elucidating that general theory has not been the primary optimization objective.
Let's go through some examples.
The Sequences: we can definitely see Eliezer's understanding of the general theory of agency in many places, especially when talking about Bayes and utility. (Engines of Cognition is a central example.) But most of the sequences talk about things like failure modes of human cognition, how to actually change your mind, social failure modes of human cognition, etc. It sure looks like the primary optimization objective is about better human thinking, plus some general philosophical foundations, not the elucidation of the general theory of agency.
Tiling agents and proof-based decision theories: I'm on board with the use of proof-based setups to make minimal assumptions about "the substrate that the agency is made of". That's an entirely reasonable choice, and it does look like t... (read more)
Hmm, yeah, we might disagree about how much reflection(self-reference) is a central part of agency in general.
It seems plausible that it is important to distinguish between the e-coli and the human along a reflection axis (or even more so, distinguish between evolution and a human). Then maybe you are more focused on the general class of agents, and MIRI is more focused on the more specific class of "reflective agents."
Then, there is the question of whether reflection is going to be a central part of the path to (F/D)OOM.
Does this seem right to you?
To operationalize, I claim that MIRI has been directed at a close enough target to yours that you probably should update on MIRI's lack of progress at least as much as you would if MIRI was doing the same thing as you, but for half as long.
Which isn't *that* large an update. The average number of agent foundations researchers (That are public facing enough that you can update on their lack of progress) at MIRI over the last decade is like 4.
Figuring out how to factor in researcher quality is hard, but it seems plausible to me that the amount of quality adjusted attention directed at your subgoal over the next decade is significantly larger than the amount of attention directed at your subgoal over the last decade. (Which would not all come from you. I do think that Agent Foundations today is non-trivially closer to John today that Agent Foundations 5 years ago is to John today.)
It seems accurate to me to say that Agent Foundations in 2014 was more focused on reflection, which shifted towards embeddedness, and then shifted towards abstraction, and that these things all flow together in my head, and so Scott thinking about abstraction will have more reflection mixed in than John thinking about abstraction. (Indeed, I think progress on abstraction would have huge consequences on how we think about reflection.)
In case it is not obvious to people reading, I endorse John's research program. (Which can maybe be inferred by the fact that I am arguing that it is similar to my own). I think we disagree about what is the most likely path after becoming less confused about agency, but that part of both our plans is yet to be written, and I think the subgoal is enough of a simple concept that I don't think disagreements about what to do next to have a strong impact on how to do the first step.
I think you've really hit the nail on the head on what's wrong (and right) with the MIRI approach. The Cartesian Frames stuff seems to be the best stuff they've done in this direction.
I've also felt that our lack of understanding of abstraction is one of the key bottlenecks. How concerned are you about insights on this question also applying to unaligned AGI development?
Enough that I have considered keeping it secret, but I think keeping it public is a strong net positive relative to our current state (i.e. giant inscrutable vectors of floating-points). If there were, say, another AI winter, then I could easily imagine changing my mind about that.
I feel like your answer to "Why do we need formalizations for engineering?" just restates the claim rather than arguing for it. It sounds like you are saying "...we need formalizations because we need gears-level understanding, and formalizations are the way you get gears-level understanding in this domain." But why are formalizations the way to gears-level understanding in this domain? There are plenty of domains where one can have gears-level understanding without formalization.
Maybe I'm just not interpreting "same rough shape" loosely enough. If pretty much any reasonable argument counts as the same rough shape as a proof, then I take back what I said.
"Look, guys, there is no way in hell that The Infinite Bureaucracy is even remotely a good approximation of a Long Reflection. Naming it “HCH” does not make it any less of an infinite bureaucracy, and yes it is going to fail in basically the same ways as real bureaucracies and for basically the same underlying reasons"
I guess this isn't immediately obvious for me. Bureaucracies fail because at each level the bosses tell the subordinates what to do and they just have to do it. In HCH, sure each subordinate performs a fixed mental task, but the the boss gets to consider the result and make up its own mind, taking into account the reports from the other subordinates. All this extra processing makes me feel as though it isn't exactly the same thing.
(I'm going to respond here to two different comments about HCH and why bureaucracies fail.)
I think a major reason why people are optimistic about HCH is that they're confused about why bureaucracies fail.
Responding to Chris: if you go look at real bureaucracies, it is not really the case that "at each level the bosses tell the subordinates what to do and they just have to do it". At every bureaucracy I've worked in/around, lower-level decision makers had many de facto degrees of freedom. You can think of this as a generalization of one of the central problems of jurisprudence: in practice, human "bosses" (or legislatures, in the jurisprudence case) are not able to give instructions which unambiguously specify what to do in all the crazy situations which come up in practice. Nor do people at the top have anywhere near the bandwidth needed to decide every ambiguous case themselves; there is far too much ambiguity in the world. So, in practice, lower-level people (i.e. judges at various levels) necessarily make many many judgement calls in the course of their work.
Also, in general, tons of information flows back up the hierarchy for higher-level people to make decisions. There are alr... (read more)
Curated. Not that many people pursue agendas to solve the whole alignment problem and of those even fewer write up their plan clearly. I really appreciate this kind of document and would love to see more like this. Shoutout to the back and forth between John and Scott Garrabrant about John's characterization of MIRI and its relation to John's work.
I want to flag that HCH was never intended to simulate a long reflection. It’s main purpose (which it fails in the worse case) is to let humans be epistemically competitive with the systems you’re trying to train.
Excellent post! This seems like a highly promising and under-explored line of attack. I've had some vaguely similar thoughts over the years, but you've done a far better job articulating and developing a coherent programme. Bravo!
I think my biggest intuitive disagreement might be with whether it is likely to be possible to create some sort of efficient 'abstraction thermometer' or 'agency thermometer'. Searching for possible ways of finding agents or abstractions in a system seems like a prototypical np-hard search problem. Now in practice it's often possible to solve such problems efficiently, but the setting with agents seems especially problematic in that keeping yourself obfuscated can be instrumentally useful, so I suspect the instances we're confronted with in the real world may be adversarially selected to be inscrutable to fast search methods in general.
There's also red teaming time, and lag in idea uptake/marketing, to account for. It's possible that we'll have the solution to FAI when AGI gets invented, but the inventor won't be connected to our community and won't be aware of/sold on the solution.
Edit: Don't forget to account for the actual engineering effort to implement the safety solution and integrate it with capabilities work. Ideally there is time for extensive testing and/or formal verification.
That's the part I find puzzling in terms of lack of time devoted to it: how can one talk about agency without figuring out the basics like that. Though I personally argued that it might not even be possible to do in this post, which conjectured that vapor bubbles"maximizing their volume" in a pot of boiling water are not qualitatively different from bacteria going against sugar gradient in search of food.
Great post. To the extent that progress can be made on this, it seems extremely important.
A question on your HCH scepticism:
I'd be interested if you could elaborate on that. To me it seems HCH shares some elements of bureaucracy, but that there are important differences.
... (read more)
- They share the property of not reliably optimising for the task they're given (HCH is best considered a sovereign, not an oracle: it's an oracle iff it wants to
I strongly agree with your focus on ambitious value learning, rather than approaches that focus more on control (e.g., myopia). What we want is an AGI that can robustly identify humans (and I would argue, any agentic system), determine their values in an iteratively improving way, and treat these learned values as its own. That is, we should be looking for models where goal alignment and a desire to cooperate with humanity is situated within a broad basin of attraction (like how corrigibility is supposed to work), where any misalignment that the AGI notice... (read more)
This post is one of the LW posts a younger version of myself would have been most excited to read. Building on what I got from the Embedded Agency sequence, this post lays out a broad-strokes research plan for getting the alignment problem right. It points to areas of confusion, it lists questions we should be able to answer if we got this right, it explains the reasoning behind some of the specific tactics the author is pursuing, and it answers multiple common questions and objections. It leaves me with a feeling of "Yeah, I could pursue that too if I wanted, and I expect I could make some progress" which is a shockingly high bar for a purported plan to solve the alignment problem. I give this post +9.
If you are looking for a very general yet simple model of agency or at least decision making you might want to have a look at The geometry of decision-making in individuals and collectives.... (read more)
"Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are [broad and robust]"
I understand why you think that broad is true. But I'm not sure I get robust. In fact, robust seems to make intuitive dis-sense to me. Your examples are gradient descent and evolution, neither of which have memory, so, how would they be able to know how "robust" an optima is? Part of me thinks that the idea comes from how, if a system optimized for a non-robust optima, it wouldn't internally be doing anything different, but we... (read more)
“It’s not that we need formalizations per se; it’s that we need gears-level understanding. We need to have some understanding of why e.g. modularity shows up in trained/evolved systems, what precisely makes that happen. The need for gears-level understanding, in turn, stems from the need for generalizability.”
In this case the most straightforward approach would be to simply derive e.coli behaviour from basic quantum chemistry, as that is the closest field where fully deterministic simulations are possible, and verifiable.
The gap between simulating hydrogen... (read more)
I think having that post on the AF would be very good. ;)
In general a great piece. One thing that I found quite relatable is the point about the preparadigmatic stage of AI safety going into later stages soon. It feels like this is already happening to some degree where there are more and more projects readily available, more prosaic alignment and interpretability projects at large scale, more work done in multiple directions and bigger organizations having explicit safety staff and better funding in general.
With these facts, it seems like there's bound to be a relatively big phase shift in research and action within the field that I'm quite excited about.
Regarding modularity - you might be interested in my Motivations, Natural Selection, and Curriculum Engineering > Modularity of Capability Accumulation from last week - it has a few speculations and (probably more usefully) a couple of references you might like (including one I stole from you).
To me the biggest parallel I see in this to existing work is to that of program correctness. It is as hard IMHO to prove program correctness (as in: this program is supposed to sort records/extract every record with inconsistent ID numbers/whatever, and actually does) as it is to write the program correctly; actually, I think it's harder. So I never pursued it. Now we see a really good reason to pursue it. And even w/ conventional, non-AI programs, we have the problem of precisely defining what we want done.
Hypothesis regarding your confusion about agency:
Describing humans using a "utility function" or through "goals" is wrong.
Humans are a bunch of habits (like CFAR TAPs) which have some correlation with working towards goals, but this is more of an imperfect rationalization than a reasonable/natural way to describe the situation.
Also yes, we have some part that thinks in goals, but it has a very limited effect on anything (like actions) compared to what we'd naturally think.
Credit to a friend
[I have no idea what I'm talking about, feel free to ignore if this doesn't resonate of course, seemed worth a comment]
I'm perpetually surprised by the amount of thought that goes into this sort of thing coupled with the lack of attention to the philosophical literature on theories of mind and agency in the past, let's just say 50 years. I mean look at the entire debate around whether or not it's possible to naturalize normativity - most of the philosophical profession has given up on this or accepts the question was at best too hard to answer, at worst, ill-conceived from the start.
These literatures are very aware of, and conversant with, the latest and greatest in cogsci... (read more)