Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

So, how’s The Plan going?

Pretty well!

In last year’s writeup of The Plan, I gave “better than a 50/50 chance” that it would work before AGI kills us all (and my median AI timelines were around 10-15 years). That was an outside view, accounting for planning fallacy and the inevitable negative surprises. My inside view was faster - just based on extrapolating my gut feel of the rate of progress, I privately estimated that The Plan would take around 8 years. (Of those 8, I expected about 3 would be needed to nail down the core conceptual pieces of agent foundations, and the other 5 would be to cross the theory-practice gap. Of course those would be intermingled, though with the theory part probably somewhat more front-loaded.)

Over the past year, my current gut feel is that progress has been basically in line with the inside-view 8 year estimate (now down to 7, since a year has passed), and maybe even a little bit faster than that.

So, relative to my outside-view expectation that things always go worse than my gut expects, things are actually going somewhat better than expected! I’m overall somewhat more optimistic now, although the delta is pretty small. It’s only been a year, still lots of time for negative surprises to appear.

Any high-level changes to The Plan?

There have been two main high-level changes over the past year.

First: The Plan predicted that, sometime over the next 5 (now 4) years, the field of alignment would “go from a basically-preparadigmatic state, where we don’t even know what questions to ask or what tools to use to answer them, to a basically-paradigmatic state, where we have a general roadmap and toolset”. Over the past year, I tentatively think the general shape of that paradigm has become visible, as researchers converge from different directions towards a common set of subproblems.

Second: I’ve updated away from thinking about ambitious value learning as the primary alignment target. Ambitious value learning remains the main long-term target, but I’ve been convinced that e.g. corrigibility is worth paying attention to as a target for early superhuman AGI. Overall, I’ve updated from “just aim for ambitious value learning” to “empirically figure out what potential medium-term alignment targets (e.g. human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AGI’s internal concept-language”.

Convergence towards a paradigm sounds exciting! So what does it look like?

Exciting indeed! Gradual convergence toward a technical alignment paradigm has probably been the most important update from the past year.

On the theoretical side, Paul Christiano, Scott Garrabrant, and myself had all basically converged to working on roughly the same problem (abstraction, ontology identification, whatever you want to call it) by early 2022. That kind of convergence is a standard hallmark of a proto-paradigm.

Meanwhile, within the past year-and-a-half or so, interpretability work has really taken off; Chris Olah’s lab is no longer head-and-shoulders stronger than everyone else. And it looks to me like the interpretability crowd is also quickly converging on the same core problem of abstraction/ontology-identification/whatever-you-want-to-call-it, but from the empirical side rather than the theoretical side.

That convergence isn’t complete yet - I think a lot of the interpretability crowd hasn’t yet fully internalized the framing of “interpretability is primarily about mapping net-internal structures to corresponding high-level interpretable structures in the environment”. In particular I think a lot of interpretability researchers have not yet internalized that mathematically understanding what kinds of high-level interpretable structures appear in the environment is a core part of the problem of interpretability. You have to interpret the stuff-in-the-net as something, and it’s approximately-useless if the thing-you-interpret-stuff-in-the-net-as is e.g. a natural-language string without any legible mathematical structure attached, or an ad-hoc mathematical structure which doesn’t particularly cut reality at the joints. But interpretability researchers have a very strong feedback loop in place, so I expect they’ll iterate toward absorbing that frame relatively quickly. (Though of course there will inevitably be debate about the frame along the way; I wouldn’t be surprised if it’s a hot topic over the next 1-2 years. And also in the comment section of this post.)

Put all that together, extrapolate, and my 40% confidence guess is that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets. That will naturally solidify into a paradigm involving interpretability work on the experiment side, plus some kind of theory work figuring out what kinds of meaningful data structures to map the internals of neural networks to.

As that shift occurs, I expect we’ll also see more discussion of end-to-end alignment strategies based on directly reading and writing the internal language of neural nets. (Retargeting The Search is one example, though it makes some relatively strong assumptions which could probably be relaxed quite a bit.) Since such strategies very directly handle/sidestep the issues of inner alignment, and mostly do not rely on a reward signal as the main mechanism to incentivize intended behavior/internal structure, I expect we’ll see a shift of focus away from convoluted training schemes in alignment proposals. On the flip side, I expect we’ll see more discussion about which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them.

Assuming this paradigm formation extrapolation is roughly correct, it’s great news! This sort of paradigm formation is exactly why The Plan was so optimistic about being able to solve alignment in the next 10-15 (well, now 9-14) years. And, if anything, it currently looks like the paradigm is coming together somewhat faster than expected.

Why the update about corrigibility?

Let’s start with why I mostly ignored corrigibility before. Mainly, I wasn’t convinced that “corrigibility” was even a coherent concept. Lists of desiderata for corrigibility sounded more like a grab-bag of tricks than like a set of criteria all coherently pointing at the same underlying concept. And MIRI’s attempts to formalize corrigibility had found that it was incompatible with expected utility maximization. That sounds to me like corrigibility not really being “a thing”.

Conversely, I expect that some of the major benefits which people want from corrigibility would naturally come from value learning. Insofar as humans want their AGI to empower humans to solve their own problems, or try to help humans do what the humans think is best even if it seems foolish to the AGI, or… , a value-aligned AI will do those things. In other words: value learning will produce some amount of corrigibility, because humans want their AGI to be corrigible. Therefore presumably there's a basin of attraction in which we get values "right enough" along the corrigibility-relevant axes.

The most interesting update for me was when Eliezer reframed the values-include-some-corrigibility argument from the opposite direction (in an in-person discussion): insofar as humans value corrigibility (or particular aspects of corrigibility), the same challenges of expressing corrigibility mathematically also need to be solved in order to target values. In other words, the key mathematical challenges of corrigibility are themselves robust subproblems of alignment, which need to be solved even for value learning. (Note: this is my takeaway from that discussion, not necessarily the point Eliezer intended.)

That argument convinced me to think some more about MIRI’s old corrigibility results. And... they're not very impressive? Like, people tried a few hacks, and the hacks didn't work. Fully Updated Deference is the only real barrier they found, and I don't think it's that much of a barrier - it mostly just shows that something is wrong with the assumed type-signature of the child agent, which isn't exactly shocking.

(Side note: fully updated deference doesn't seem like that much of a barrier in the grand scheme of things, but it is still a barrier which will probably block whatever your first idea is for achieving corrigibility. There are probably ways around it, but you need to actually find and use those ways around.)

While digging around old writing on the topic, I also found an argument from Eliezer that “corrigibility” is a natural concept:

The "hard problem of corrigibility" is interesting because of the possibility that it has a relatively simple core or central principle - rather than being value-laden on the details of exactly what humans value, there may be some compact core of corrigibility that would be the same if aliens were trying to build a corrigible AI, or if an AI were trying to build another AI.

We can imagine, e.g., the AI imagining itself building a sub-AI while being prone to various sorts of errors, asking how it (the AI) would want the sub-AI to behave in those cases, and learning heuristics that would generalize well to how we would want the AI to behave if it suddenly gained a lot of capability or was considering deceiving its programmers and so on.

Now that sounds like the sort of thing which is potentially useful! Shame that previous attempts to formulate corrigibility started with kinda-ad-hoc desiderata, rather than from an AI building a sub-AI while being prone to various sorts of errors. (Pro tip for theory work: when you’re formalizing a concept, and you have some intuitive argument for why it’s maybe a natural concept, start from that argument!)

So my overall takeaway here is:

  • There’s at least a plausible intuitive argument that corrigibility is A Thing.
  • Previous work on formalizing/operationalizing corrigibility was pretty weak.

So are you targeting corrigibility now?

No. I’ve been convinced that corrigibility is maybe A Thing; my previous reasons for mostly-ignoring it were wrong. I have not been convinced that it is A Thing; it could still turn out not to be.

But the generalizable takeaway is that there are potentially-useful alignment targets which might turn out to be natural concepts (of which corrigibility is one). Which of those targets actually turn out to be natural concepts is partially a mathematical question (i.e. if we can robustly formulate it mathematically then it’s definitely natural), and partially empirical (i.e. if it ends up being a natural concept in an AI’s internal ontology then that works too).

So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AI’s internal language (which itself probably includes a lot of mathematics) is an empirical question, and that’s the main question which determines what we should target.

How has broadening alignment target changed your day-to-day research?

It hasn’t. The reason is explained in Plans Are Predictions, Not Optimization Targets. Briefly: the main thing I’m working on is becoming generally less confused about how agents work. While doing that, I mostly aim for robust bottlenecks - understanding abstraction, for instance, is robustly a bottleneck for many different approaches (which is why researchers converge on it from many different directions). Because it’s robust, it’s still likely to be a bottleneck even when the target shifts, and indeed that is what happened.

What high-level progress have you personally made in the past year? Any mistakes made or things to change going forward?

In my own work, theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)

As of The Plan, by six months ago I was hoping to have efficient algorithms for computing natural abstractions in simulated environments, and that basically didn’t happen. I did do a couple interesting experiments (which haven’t been written up):

  • Both Jeffery Andrade and myself tried to calculate natural abstractions in the Game of Life, which basically did not work.
  • I tried to calculate “local” natural abstractions (in a certain sense) in a generative image net, and that worked quite well.

… but mostly I ended up allocating time to other things. The outputs of those experiments were what I need for now; I’m back to being bottlenecked on theory. (Which is normal - running a computational experiment and exploring the results in detail takes a few days or maybe a couple weeks at most, which is far faster than an iteration cycle on theory development, so of course I spend most of my time bottlenecked on theory.)

On the theory side, progress has zoomed along surprisingly quickly despite spending less time on it than I expected as of late last year. The Basic Foundations sequence is the main publicly-visible artifact of that progress so far; behind the scenes I’ve also continued to streamline the math of natural abstraction, and lately I’ve been working to better unify it with thermodynamic-style arguments and phase changes. (In particular, my current working hypothesis is that grokking is literally a phase change in the thermodynamic sense, induced by coupling to the environment via SGD. On that hypothesis, understanding how such coupling-induced phase changes work is the main next step to mapping net-internal structures to natural abstractions in the environment. But that’s the sort of hypothesis which could easily go out the window in another few weeks.) The main high-level update from the theory work is that, while getting abstraction across the theory-practice gap continues to be difficult, basically everything else about agent foundations is indeed way easier once we have a decent working operationalization of abstraction.

So I’ve spent less time than previously expected on both theory, and on crossing the theory-practice gap. Where did all that time go?

First, conferences and workshops. I said “yes” to basically everything in the first half of 2022, and in hindsight that was a mistake. Now I’m saying “no” to most conferences/workshops by default.

Second, training people (mostly in the MATS program), and writing up what I’d consider relatively basic intro-level arguments about alignment strategies which didn’t have good canonical sources. In the coming year, I’m hoping to hand off most of the training work; at this point I think we have a scalable technical alignment research training program which at least picks the low-hanging fruit (relative to my current ability to train people). In particular, I continue to be optimistic that (my version of) the MATS program shaves at least 3 years off the time it takes participants to get past the same first few bad ideas which everyone has and on to doing potentially-useful work.

What’s the current status of your work on natural abstractions?

In need of a writeup. I did finally work out a satisfying proof of the maxent form for natural abstractions on Bayes nets, and it seems like every week or two I have an interesting new idea for a way to use it. Writing up the proofs as a paper is currently on my todo list; I’m hoping to nerd-snipe some researchers from the complex systems crowd.

Getting it across the theory-practice gap remains the next major high-level step. The immediate next step is to work out and implement the algorithms implied by the maxent form.

New Comment
37 comments, sorted by Click to highlight new comments since: Today at 2:13 PM

theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality

I appreciate you flagging this. I read the former sentence and my immediate next thought was the heuristic in the parenthetical sentence.

by six months ago I was hoping to have efficient algorithms for computing natural abstractions in simulated environments

Can you provide an example of what this would look like?

  • Both Jeffery Andrade and myself tried to calculate natural abstractions in the Game of Life, which basically did not work.
  • I tried to calculate “local” natural abstractions (in a certain sense) in a generative image net, and that worked quite well.

What are some examples of natural abstractions you were looking for, and how did you calculate or fail to calculate them?

[-]Vika1yΩ7140

It's great to hear that you have updated away from ambitious value learning towards corrigibility-like targets. It sounds like you now find it plausible that corrigibility will be a natural concept in the AI's ontology, despite it being incompatible with expected utility maximization. Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?

I'm also curious how optimistic you are about the interpretability field being able to solve the empirical side of the abstraction problem in the next 5-10 years. Current interpretability work is focused on low-level abstractions (e.g. identifying how a model represents basic facts about the world) and extending the current approaches to higher-level abstractions seems hard. Do you think the current interpretability approaches will basically get us there or will we need qualitatively different methods? 

Bah! :D It's sad to hear he's updated away from ambitions value learning towards corrigiblity-like targets. Eliezer's second-hand argument sounds circular to me; suppose that corrigibility as we'd recognize it isn't a natural abstraction - then generic AIs wouldn't use it to align child agents (instead doing something like value learning, or something even more direct), and so there wouldn't be a bunch of human-independent examples, so it wouldn't show up as a natural abstraction to those AIs.

I think the pointer “the thing I would do if I wanted to make a second AI that would be the best one I could make at my given intelligence” is what is being updated in favor of, since this does feel like a natural abstraction, given how many agents would think this (also seems very similar to the golden rule. “I will do what I would want a successor AI to do if the successor AI was actually the human’s successor AI”. or “treat others (the human) how I’d like to be treated (by a successor AI), (and abstracting one meta-level upwards)”). Whether this turns out to be value learning or something else 🤷. This seems a different question from whether or not it is indeed a natural abstraction.

Interesting. What is it that potentially makes "treat the human like I would like to be treated if I had their values" easier than "treat the human like they would like to be treated"?

John usually does not make his plans with an eye toward making things easier. His plan previously involved values because he thought they were strictly harder than corrigibility. If you solve values, you solve corrigibility. Similarly, if you solve abstraction, you solve interpretability, shard theory, value alignment, corrigibility, etc.

I don’t know all the details of John’s model here, but it may go something like this: If you solve corrigibility, and then find out corrigibility isn’t sufficient for alignment, you may expect your corrigible agent to help you build your value aligned agent.

Similarly, if you solve abstraction, you solve interpretability, shard theory, value alignment, corrigibility, etc.

In what way do you think solving abstraction would solve shard theory?

Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?

When talking about whether some physical system "is a utility maximizer", the key questions are "utility over what variables?", "in what model do those variables live?", and "with respect to what measuring stick?". My guess is that a corrigible AI will be a utility maximizer over something, but maybe not over the AI-operator interface itself? I'm still highly uncertain what that type-signature will look like, but there's a lot of degrees of freedom to work with.

Do you think the current interpretability approaches will basically get us there or will we need qualitatively different methods?

We'll need qualitatively different methods. But that's not new; interpretability researchers already come up with qualitatively new methods pretty regularly.

Any updates to your model of the socioeconomic path to aligned AI deployment? Namely:

  • Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?
  • Still on the "figure out agency and train up an aligned AGI unilaterally" path?
  • Has the FTX fiasco impacted your expectation of us-in-the-future having enough money=compute to do the latter?

I expect there to be no major updates, but seems worthwhile to keep an eye on this.

So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AI’s internal language (which itself probably includes a lot of mathematics) is an empirical question, and that’s the main question which determines what we should target.

I'd like to make a case that Do What I Mean will potentially turn out to be the better target than corrigibility/value learning.

Primarily, "Do What I Mean" is about translation. Entity 1 compresses some problem specification defined over Entity 1's world-model into a short data structure — an order, a set of values, an objective function, etc. — then Entity 2 uses some algorithm to de-compress that data structure and translate it into a problem specification defined over Entity 2's world-model. The problem of alignment via Do What I Mean, then, is the problem of ensuring that Entity 2 (which we'll assume to be bigger) decompresses a specific type of compressed data structures using the same algorithm that was used to compress them in the first place — i. e., interprets orders the way they were intended/acts on our actual values and not the misspecified proxy/extrapolates our values from the crude objective function/etc.

This potentially has the nice property of collapsing the problem of alignment to the problem of ontology translation, and so unifying the problem of interpreting an NN and the problem of aligning an NN into the same problem.

In addition, it's probably a natural concept, in the sense that "how do I map this high-level description onto a lower-level model" seems like a problem any advanced agent would be running into all the time. There'll almost definitely be concepts and algorithms about that in the AI's world-model, and they may be easily repluggable.

Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?

Here's a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)

My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 years are compared to pre-transformer models, then I'd expect them to be at least human-level. That does not mean that nets will get to human level immediately after that transformer-level shift comes along; e.g. with transformers it still took ~2-3 years before transformer models really started to look impressive.

So the most important update from deep learning over the past year has been the lack of any transformer-level paradigm shift in algorithms, architectures, etc.

There are of course other potential paths to human-level (or higher) which don't route through a transformer-level paradigm shift in deep learning. One obvious path is to just keep scaling; I expect we'll see a paradigm shift well before scaling alone achieves human-level AGI (and this seems even more likely post-Chinchilla). The main other path is that somebody wires together a bunch of GPT-style AGIs in such a way that they achieve greater intelligence by talking to each other (sort of like how humans took off via cultural accumulation); I don't think that's very likely to happen near-term, but I do think it's the main path by which 5-year timelines would happen without a paradigm shift. Call it maybe 5-10%. Finally, of course, there's always the "unknown unknowns" possibility.

How long until the next shift?

Back around 2014 or 2015, I was visiting my alma mater, and a professor asked me what I thought about the deep learning wave. I said it looked pretty much like all the previous ML/AI hype cycles: everyone would be very excited for a while and make grand claims, but the algorithms would be super finicky and unreliable. Eventually the hype would die down, and we’d go into another AI winter. About ten years after the start of the wave someone would show that the method (in this case large CNNs) was equivalent to some Bayesian model, and then it would make sense when it did/didn’t work, and it would join the standard toolbox of workhorse ML algorithms. Eventually some new paradigm would come along, and the hype cycle would start again.

… and in hindsight, I think that was basically correct up until transformers came along around 2017. Pre-transformer nets were indeed very finicky, and were indeed shown equivalent to some Bayesian model about ten years after the excitement started, at which point we had a much better idea of what they did and did not do well. The big difference from previous ML/AI hype waves was that the next paradigm - transformers - came along before the previous wave had died out. We skipped an AI winter; the paradigm shift came in ~5 years rather than 10-15.

… and now it’s been about five years since transformers came along. Just naively extrapolating from the two most recent data points says it’s time for the next shift. And we haven’t seen that shift yet. (Yes, diffusion models came along, but those don’t seem likely to become a transformer-level paradigm shift; they don’t open up whole new classes of applications in the same way.)

So on the one hand, I'm definitely nervous that the next shift is imminent. On the other hand, it's already very slightly on the late side, and if another 1-2 years go by I'll update quite a bit toward that shift taking much longer.

Also, on an inside view, I expect the next shift to be quite a bit more difficult than the transformers shift. (I don't plan to discuss the reasons for that, because spelling out exactly which technical hurdles need to be cleared in order to get nets to human level is exactly the sort of thing which potentially accelerates the shift.) That inside view is a big part of why my timelines last year were 10-15 years, and not 5. The other main reasons my timelines were 10-15 years were regression to the mean (i.e. the transformers paradigm shift came along very unusually quickly, and it was only one data point), general hype-wariness, and an intuitive sense that unknown unknowns in this case will tend to push toward longer timelines rather than shorter on net.

Put all that together, and there's a big blob of probability mass on ~5 year timelines; call that 20-30% or so. But if we get through the next couple years without a transformer-level paradigm shift, and without a bunch of wired-together GPTs spontaneously taking off, then timelines get a fair bit lot longer, and that's where my median world is.

Still on the "figure out agency and train up an aligned AGI unilaterally" path?

"Train up an AGI unilaterally" doesn't quite carve my plans at the joints.

One of the most common ways I see people fail to have any effect at all is to think in terms of "we". They come up with plans which "we" could follow, for some "we" which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in "we" implementing the plan. (And also, usually, the "we" in question is too dysfunctional as a group to implement the plan even if all the individuals wanted to, because that is how approximately 100% of organizations of more than 10 people operate.) In cognitive terms, the plan is pretending that lots of other peoples' actions are choosable/controllable, when in fact those other peoples' actions are not choosable/controllable, at least relative to the planner's actual capabilities.

The simplest and most robust counter to this failure mode is to always make unilateral plans.

But to counter the failure mode, plans don't need to be completely unilateral. They can involve other people doing things which those other people will actually predictably do. So, for instance, maybe I'll write a paper about natural abstractions in hopes of nerd-sniping some complex systems theorists to further develop the theory. That's fine; the actions which I need to counterfact over in order for that plan to work are actions which I can in fact take unilaterally (i.e. write a paper). Other than that, I'm just relying on other people acting in ways in which they'll predictably act anyway.

Point is: in order for a plan to be a "real plan" (as opposed to e.g. a fabricated option, or a de-facto applause light), all of the actions which the plan treats as "under the planner's control" must be actions which can be taken unilaterally. Any non-unilateral actions need to be things which we actually expect people to do by default, not things we wish they would do.

Coming back to the question: my plans certainly do not live in some childrens' fantasy world where one or more major AI labs magically become the least-dysfunctional multiple-hundred-person organizations on the planet, and then we all build an aligned AGI via the magic of Friendship and Cooperation. The realistic assumption is that large organizations are mostly carried wherever the memetic waves drift. Now, the memetic waves may drift in a good direction - if e.g. the field of alignment does indeed converge to a paradigm around decoding the internal language of nets and expressing our targets in that language, then there's a strong chance the major labs follow that tide, and do a lot of useful work. And I do unilaterally have nonzero ability to steer that memetic drift - for instance, by creating public knowledge of various useful lines of alignment research converging, or by training lots of competent people.

That's the sort of non-unilaterality which I'm fine having in my plans: relying on other people to behave in realistic ways, conditional on me doing things which I can actually unilaterally do.

Has the FTX fiasco impacted your expectation of us-in-the-future having enough money=compute to do the latter?

Basically no.

I'd like to make a case that Do What I Mean will potentially turn out to be the better target than corrigibility/value learning. ...

I basically buy your argument, though there's still the question of how safe a target DWIM is.

A couple of weeks ago I started blitzing my way through one of your posts on natural abstraction and, wham! it hit me: J.J. Gibson, ecological psychology. Are you familiar with that body of work? Gibson's idea was that the environment has affordances (he's the one who brought that word to prominence) which are natural "points of attachment" [my phrase] for perceptual processes. It seems to me that his affordances are the low-dimensional projections (or whatever) that are the locuses of your natural abstractions. Gibson didn't have the kind of mathematical framework you're interested in, though I have the vague sense that some people who've been influenced by him have worked with complex dynamics.

And then there's the geometry of meaning Peter Gärdenfors has been developing: Conceptual Spaces, MIT 2000 and The Geometry of Meaning, MIT 2014. He argues that natural language semantics is organized into very low dimensional conceptual spaces. Might have some clues of things to look for.

If I want to know more about these two things, which papers/books should I read?

Hmmm...On Gibson, I'd read his last book, An Ecological Approach to Visual Perception (1979). I'd also look at his Wikipedia entry. You might also check out Donald Norman, a cognitive psychologist who adapted Gibson's ideas to industrial design while at Apple and then as a private consultant.

On Gärdenfors the two books are good. You should start with the 2000 book. But you might want to look at an article first: Peter Gärdenfors, An Epigenetic Approach to Semantic Categories, IEEE Transactions on Cognitive and Developmental Systems (Volume: 12 , Issue: 2, June 2020 ) 139 – 147. DOI: 10.1109/TCDS.2018.2833387 (sci-hub link, https://sci-hub.tw/10.1109/TCDS.2018.2833387). Here's a video of a recent talk, Peter Gärdenfors: Conceptual Spaces, Cognitive Semantics and Robotics: https://youtu.be/RAAuMT-K1vw

What sort of value do you expect to get out of "crossing the theory-practice gap"?

Do you think that this will result in better insights about which direction to focus in during your research, for example? 

Some general types of value which are generally obtained by taking theories across the theory-practice gap:

  • Finding out where the theory is wrong
  • Direct value from applying the theory
  • Creating robust platforms upon which further tools can be developed

Will we have to wait until Dec 2023 for the next update or will the amount of time until the next one halve for each update, 6 months then 3 months then 6 weeks then 3 weeks?

Someone working full-time on an approach to the alignment problem that they feel optimistic about, and writing annual reflections on their work, is something that has been sorely lacking. +4

Did you not talk to Eliezer (or Stuart or Paul or...) about Corrigibility before the conversation you cited? It seems like they should have been able to change your mind quite easily on the topic, from what you wrote.

Have you done any work on thermodynamic coupling inducing phase transitions? If not, I'd recommend looking using a path integral formulation to frame the issue. David Tong's notes are a good introduction to the topic. Feynman's book on path integrals serves as a great refresher on the topic, with a couple of good chapters on probability theory and thermodynamic coupling. I lost my other reference texts, so I can't recommend anything else off the top of my head. 

Nice update!

On the flip side, I expect we’ll see more discussion about which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them.

While I don't think of these as alignment targets per se (as I understand the term to be used), I strongly support discussing the internal language of the neural net and moving away from convoluted inner/outer schemes.

On the theoretical side, Paul Christiano, Scott Garrabrant, and myself had all basically converged to working on roughly the same problem (abstraction, ontology identification, whatever you want to call it) by early 2022.

I haven't been following Scott's work that closely; what part of it are you calling "roughly the same problem?" (The titles of his recent LW posts don't seem to contain relevant words.)

I don't think Scott's published much on the topic yet. It did come up in this comment thread on The Plan, and I've talked to him in-person about the topic from time to time. We haven't synced up in a while, so I don't know what he's up to the last few months.

A question about alignment via natural abstractions (if you've addressed it before, please refer me to where): it seems to me plausible that natural abstractions exist but are not useful for alignment, because alignment is a high-dimensional all-or-nothing property. Like, the AI will learn about "trees", but not unintentionally killing everyone depends on whether a palm tree is a tree, or on whether a copse counts as full of trees, or some other questions which depends on unnatural details of the natural abstraction.

  • Do you think that edge cases will just naturally be correctly learned?
  • Do you think that edge cases just won't end up mattering for alignment?

Definitions, as we usually use them, are not the correct data structure for word-meaning. Words point to clusters in thing-space; definitions try to carve up those clusters with something like cutting-planes. That's an unreliable and very lossy way to represent clusters, and can't handle edge-cases well or ambiguous cases at all. The natural abstractions are the clusters (more precisely the summary parameters of the clusters, like e.g. cluster mean and variance in a gaussian cluster model); they're not cutting-planes.

I don't think "definitions" are the crux of my discomfort. Suppose the model learns a cluster; the position, scale, and shape parameters of this cluster summary are not perfectly stable--that is, they vary somewhat with different training data. This is not a problem on its own, because it's still basically the same; however, the (fuzzy) boundary of the cluster is large (I have a vague intuition that the curse of dimensionality is relevant here, but nothing solid). This means that there are many cutting planes, induced by actions to be taken downstream of the model, on which training on different data could have yielded a different result. My intuition is that most of the risk of misalignment arises at those boundaries:

  • One reason for my intuition is that in communication between humans, difficulties arise in a similar way (i.e. when two peoples clusters have slightly different shapes)
  • One reason is that the boundary cases feel like the kind of stuff you can't reliably learn from data or effectively test.

Your comment seems to be suggesting that you think the edge cases won't matter, but I'm not really understanding why the fuzzy nature of concepts makes that true.

seems like maybe the naturalness of abstracting a cluster is the disagreement in ensemble of similar-or-shorter-length equivalent models? if your abstraction is natural enough, it always holds. if it's merely approaching the natural abstraction, it'll approximate it. current artificial neural networks are probably not strong enough to learn all the relevant natural abstractions to a given context, but they move towards them, some of the way.

is yudkowsky's monster a claim about the shape of the most natural abstractions? perhaps what we really need is to not assume we know which abstractions have been confirmed to be acceptably natural. ie, perhaps his body of claims about this boils down to "oh maybe soft optimization is all one could possibly ask for until all matter is equally superintelligent, so that we don't break decision theory and all get taken over by a selfish [?genememetic element?]" or something funky along those lines.

[??] I don't have a term of art that generalizes these things properly; genes/memes/executable shape fragments in generality

I’m interested in why the transformer architecture has been so successful and the concept of natural abstraction is useful here. I’m not thinking so much about how transformers work, not in any detail, but about the natural environment in which the architecture was originally designed to function, text. What are the natural abstractions over text?

Let’s look at this, a crucial observation:

That convergence isn’t complete yet - I think a lot of the interpretability crowd hasn’t yet fully internalized the framing of “interpretability is primarily about mapping net-internal structures to corresponding high-level interpretable structures in the environment”. In particular I think a lot of interpretability researchers have not yet internalized that mathematically understanding what kinds of high-level interpretable structures appear in the environment is a core part of the problem of interpretability.

For a worm the external environment is mostly dirt, but other worms as well. Birds, a rather different natural environment. Then we have the great apes, still a different natural environment. 

Our natural environment is very much like that of the great apes, our close biological relatives. Our fellow conspecifics are a big part of that environment. But we have language as well. Language is part of our natural environment, spoken language, but text as well. The advent of writing has made it possible to immerse ourselves in a purely textual world. Arithmetic, too, is a textual world. Think of it as a very specialized form of language, with a relatively small set of primitive elements and a highly constrained syntax. And then there is program code.

You see where this is going?

For the transformer architecture, text is its natural environment. The attention mechanism allows it to get enough context so that it can isolate the natural abstractions in an environment of pure text. What’s in that environment? Language, whether spoken or written is, after all, a physical thing. Alphanumeric characters, spaces between strings of characters, punctuation, capitalization, paragraphing conventions. Stuff like that. It’s the transformer’s superior capacity to abstract over that environment that has made it so successful.

(I note, as an aside, that speech is a physically complex and requires its own natural abstractions for perception. Phoneticians study the physical features of the soundstream itself while phonologists study those aspects of the signal that are linguistically relevant. They're thus looking for the natural abstractions over the physical signal itself. When put online, the physical text is extraordinarily simple – strings of ASCII characters – so so requires no sophisticated perceptual mechanisms.)

Note, of course, that text is text. Ultimately textual meanings have to be grounded in the physical world. No matter how much text a transformer consumes, it’s unlikely to substitute for direct access to the natural world. So the architecture can’t do everything, but it can do a lot.

I’ve recently written two posts examining the output of ChatGPT for higher-level discourse structure, mostly the conventions of dialog. It’s simple stuff, obvious in a way. But it has all but convincement that GPT I is picking up a primitive kind of discourse grammar that is separate from and independent from the word-by-word grammar of individual sentences. I can’t see how it would be able to produce such fluent text if it weren’t doing that.

The posts: 

Of pumpkins, the Falcon Heavy, and Groucho Marx: High-Level discourse structure in ChatGPT

High level discourse structure in ChatGPT: Part 2 [Quasi-symbolic?]

Curated! To ramble a bit on why: I love how this post makes me feel like I have a good sense of what John has been up to, been thinking about, and why, the insight of asking "how would an AI ensure a child AI is aligned with it?" feels substantive, and the optimism is nice and doesn't seem entirely foolhardy. Perhaps most significantly, it feels to me like a very big deal if alignment is moving towards something paradigmatic (should models and assumptions and questions and methods). I had thought that something we weren't going to get, but John does point out that many people are converging on similar interpretability/abstraction targets, and now he does point it out, that seems true and hopeful. I'm not an alignment researcher myself, so I don't put too much stock in my assessment, but this update is one of the most hopeful things I've read any time recently.

Forgive me if the answer to this would be obvious given more familiarity with natural abstractions, but is your claim that interpretability research should identify mathematically defined high-level features rather than fuzzily defined features? Supposing that in optimistic versions of interpretability, we're able to say that this neuron corresponds to this one concept and this one circuit in the network is responsible for this one task (and we don't have to worry about polysemanticity). How do we define concepts like "trees" and "summarizing text in a way that labelers like" in mathematical way?

So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AI’s internal language (which itself probably includes a lot of mathematics) is an empirical question, and that’s the main question which determines what we should target.

Do you expect that the network will have an accurate understanding of its goals? I'd expect that we could train an agentic language model which is still quite messy and isn't able to reliably report information about itself and even if it could, it probably wouldn't know how to express it mathematically. I think a model could be able to write a lot of text about human values and corrigibility and yet fail to have a crisp or mathematical concept for either of them.

Overall, I’ve updated from “just aim for ambitious value learning” to “empirically figure out what potential medium-term alignment targets (e.g. human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AGI’s internal concept-language”.

I like this. In fact, I would argue that some of those medium-term alignment targets are actually necessary stepping stones toward ambitious value learning.

Human mimicry, for one, could serve as a good behavioral prior for IRL agents. AI that can reverse-engineer the policy function of a human (e.g., by minimizing the error between the world-state-trajectory caused by its own actions and that produced by a human's actions) is probably already most of the way there toward reverse-engineering the value function that drives it (e.g., start by looking for common features among the stable fixed points of the learned policy function). I would argue that the intrinsic drive to mimic other humans is a big part of why humans are so adept at aligning to each other.

Do What I Mean (DWIM) would also require modeling humans in a way that would help greatly in modeling human values. A human that gives an AI instructions is mapping some high-dimensional, internally represented goal state into a linear sequence of symbols (or a 2D diagram or whatever). DWIM would require the AI to generate its own high-dimensional, internally represented goal states, optimizing for goals that give a high likelihood to the instructions it received. If achievable, DWIM could also help transform the local incentives for general AI capabilities research into something with a better Nash equilibrium. Systems that are capable of predicting what humans intended for them to do could prove far more valuable to existing stakeholders in AI research than current DL and RL systems, which tend to be rather brittle and prone to overfitting to the heuristics we give them.

Post summary (feel free to suggest edits!):
Last year, the author wrote up an plan they gave a “better than 50/50 chance” would work before AGI kills us all. This predicted that in 4-5 years, the alignment field would progress from preparadigmatic (unsure of the right questions or tools) to having a general roadmap and toolset. 

They believe this is on track and give 40% likelihood that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets - with interpretability on the experimental side, in addition to theoretical work. This could lead to identifying which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them. They think we should then focus on those.

In their personal work, they’ve found theory work faster than expected, and crossing the theory-practice gap mildly slower. In 2022 most of their time went into theory work like the Basic Foundations sequence, workshops and conferences, training others, and writing up intro-level arguments on alignment strategies.

(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)

You might mention that the prediction about the next 1-2 years is only at 40% confidence, and the "8-year" part is an inside view whose corresponding outside view estimate is more like 10-15 years.

Cheers, edited :)

I like this post! It clarifies a few things I was confused on about your agenda and the progress you describe sounds pretty damn promising, although I only have intuitions here about how everything ties together.

In the interest of making my abstract intuition here more precise, a few weird questions:

Put all that together, extrapolate, and my 40% confidence guess is that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets. That will naturally solidify into a paradigm involving interpretability work on the experiment side, plus some kind of theory work figuring out what kinds of meaningful data structures to map the internals of neural networks to.

What does your picture of (realistically) ideal outcomes from theory work look like? Is it more giving interpretability researchers a better frame to reason under (like a more mathematical notion of optimization that we have to figure out how to detect in large nets against adversaries) or something even more ambitious that designs theoretical interpretability processes that Just Work, leaving technical legwork (what ELK seems like to me)?

While they definitely share core ideas of ontology mismatch, it feels like the approaches are pretty different in that you prioritize mathematical definitions a lot and ARC is heuristical. Do you think the mathematical stuff is necessary for sufficient deconfusion, or just a pretty tractable way to arrive at the answers we want?

We can imagine, e.g., the AI imagining itself building a sub-AI while being prone to various sorts of errors, asking how it (the AI) would want the sub-AI to behave in those cases, and learning heuristics that would generalize well to how we would want the AI to behave if it suddenly gained a lot of capability or was considering deceiving its programmers and so on.

I'm not really convinced that even if corrigibility is A Thing (I agree that it's plausible it is, but I think it could also just be trivially part of another Thing given more clarity), it's as good as other medium-term targets. Corrigibility as stated doesn't feel like it covers a large chunk of the likely threat models, and a broader definition seems like it's just rephrasing a bunch of the stuff from Do What I Mean or inner alignment. What am I missing about why it might be as good a target?

I'd like to offer a general observation about ontology. As far as I can tell the concept enter computer science through Old School work in symbolic computation in AI. So you want to build a system that can represent all of human knowledge? OK. What are the primitive elements of such a system? What objects, events, and processes, along with the relations between them, what do you need? That's your ontology. From you generalize to any computing system: what are the primitives and what can you construct from them? 

If you want to take a peak at the Old School literature, John Sowa offers one view. Note that this is not a general survey of the literature. It's one man's distillation of it. Sowa worked at IBM Research (at Armonk I believe) for years.

I was interested in the problem, and still am, and did a variety of work. One of the things I did was write a short article on the "Ontology of Common Sense" for a Handbook of Metaphysics and Ontology which you can find here:

The opening three paragraphs:

The ontology of common sense is the discipline which seeks to establish the categories which are used in everyday life to characterize objects and events. In everyday life steel bars and window panes are solid objects. For the scientist, the glass of the window pane is a liquid, and the solidity of both the window pane and the steel bar is illusory, since the space they occupy consists mostly of empty regions between the sub-atomic particles which constitute these objects. These facts, however, have no bearing on the ontological categories of common sense. Sub-atomic particles and solid liquids do not exist in the domain of common sense. Common sense employs different ontological categories from those used in the various specialized disciplines of science.

Similar examples of differences between common sense and scientific ontologies can be multiplied at will. The common sense world recognizes salt, which is defined in terms of its colour, shape, and, above all, taste. But the chemist deals with sodium chloride, a molecule consisting of sodium and chlorine atoms; taste has no existence in this world. To common sense, human beings are ontologically distinct from animals; we have language and reason, animals do not. To the biologist there is no such distinction; human beings are animals; language and reason evolved because they have survival value. Finally, consider the Morning Star and the Evening Star. Only by moving from the domain of common sense to the domain of astronomy can we assert that these stars are not stars at all, but simply different manifestations of the planet Venus.

In all of these cases the common sense world is organized in terms of one set of object categories, predicates, and events while the scientific accounts of the same phenomena are organized by different concepts. In his seminal discussion of natural kinds, Quine suggested that science evolves by replacing a biologically innate quality space, which gives rise to natural kinds (in our terms, the categories of a common sense ontology), with new quality spaces. However, Quine has little to say about just how scientific ontology evolved from common sense ontology.

I suspect that there's a lot of structure between raw sensory experience and common sense ontology and a lot more between that and the ontologies of various scientific disciplines. But, you know, I wouldn't be surprised if a skilled auto mechanic has their own ontology of cars, a lot of it primarily non-verbal and based on the feels and sounds of working on cars with your hands.

Here's my references, with brief annotations, which indicate something of the range of relevant work that's been done in the past:

Berlin, B., Breedlove, D., Raven, P. 1973. "General Principles of Classification and Nomenclature in Folk Biology," American Anthropologist, 75, 214 - 242. There's been quite a lot of work on folk taxonomy. In some ways it's parallel to the (more) formal taxonomies of modern biology. But there are differences as well.

Hayes, P. J. 1985. "The Second Naive Physics Manifesto," in Formal Theories of the Commonsense World, J. R. Hobbs and R. C. Moore, eds., Ablex Publishing Co., 1 - 36. A lot of work has been done in this area, including work on college students who may have ideas about Newtonian dynamics in their heads but play video games in a more Aristotelian way.

Keil, F. C. 1979. Semantic and Conceptual Development: An Ontological Perspective, Cambridge, Massachusetts and London, England: Harvard University Press. How children develop concepts.

Quine, W. V.1969. "Natural Kinds," in Essays in Honor of Carl G. Hempel, Nicholas Rescher et al., eds., D. Reidel Publishing Co., 5 - 23. That is to say, are there natural kinds or is it culture all the way down.

Rosch, E. et al. 1976. "Basic Objects in Natural Categories," Cognitive Psychology, 8, 382 - 439. A key text introducing something called prototype theory.

Sommers, F. 1963."Types and Ontology," Philosophical Review, 72, 327 - 363. Do you know what philosophers mean by a category mistake? This is about the logic behind them.