Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?
When talking about whether some physical system "is a utility maximizer", the key questions are "utility over what variables?", "in what model do those variables live?", and "with respect to what measuring stick?". My guess is that a corrigible AI will be a utility maximizer over something, but maybe not over the AI-operator interface itself? I'm still highly uncertain what that type-signature will look like, but there's a lot of degrees of freedom to work with.
Do you think the current interpretability approaches will basically get us there or will we need qualitatively different methods?
We'll need qualitatively different methods. But that's not new; interpretability researchers already come up with qualitatively new methods pretty regularly.
Some general types of value which are generally obtained by taking theories across the theory-practice gap:
Has the FTX fiasco impacted your expectation of us-in-the-future having enough money=compute to do the latter?
I'd like to make a case that Do What I Mean will potentially turn out to be the better target than corrigibility/value learning. ...
I basically buy your argument, though there's still the question of how safe a target DWIM is.
Still on the "figure out agency and train up an aligned AGI unilaterally" path?
"Train up an AGI unilaterally" doesn't quite carve my plans at the joints.
One of the most common ways I see people fail to have any effect at all is to think in terms of "we". They come up with plans which "we" could follow, for some "we" which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in "we" implementing the plan. (And also, usually, the "we" in question is too dysfunctional as a group to implement the plan even if all the individuals wanted to, because that is how approximately 100% of organizations of more than 10 people operate.) In cognitive terms, the plan is pretending that lots of other peoples' actions are choosable/controllable, when in fact those other peoples' actions are not choosable/controllable, at least relative to the planner's actual capabilities.
The simplest and most robust counter to this failure mode is to always make unilateral plans.
But to counter the failure mode, plans don't need to be completely unilateral. They can involve other people doing things which those other people will actually predictably do. So, for instance, maybe I'll write a paper about natural abstractions in hopes of nerd-sniping some complex systems theorists to further develop the theory. That's fine; the actions which I need to counterfact over in order for that plan to work are actions which I can in fact take unilaterally (i.e. write a paper). Other than that, I'm just relying on other people acting in ways in which they'll predictably act anyway.
Point is: in order for a plan to be a "real plan" (as opposed to e.g. a fabricated option, or a de-facto applause light), all of the actions which the plan treats as "under the planner's control" must be actions which can be taken unilaterally. Any non-unilateral actions need to be things which we actually expect people to do by default, not things we wish they would do.
Coming back to the question: my plans certainly do not live in some childrens' fantasy world where one or more major AI labs magically become the least-dysfunctional multiple-hundred-person organizations on the planet, and then we all build an aligned AGI via the magic of Friendship and Cooperation. The realistic assumption is that large organizations are mostly carried wherever the memetic waves drift. Now, the memetic waves may drift in a good direction - if e.g. the field of alignment does indeed converge to a paradigm around decoding the internal language of nets and expressing our targets in that language, then there's a strong chance the major labs follow that tide, and do a lot of useful work. And I do unilaterally have nonzero ability to steer that memetic drift - for instance, by creating public knowledge of various useful lines of alignment research converging, or by training lots of competent people.
That's the sort of non-unilaterality which I'm fine having in my plans: relying on other people to behave in realistic ways, conditional on me doing things which I can actually unilaterally do.
Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?
Here's a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)
My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 years are compared to pre-transformer models, then I'd expect them to be at least human-level. That does not mean that nets will get to human level immediately after that transformer-level shift comes along; e.g. with transformers it still took ~2-3 years before transformer models really started to look impressive.
So the most important update from deep learning over the past year has been the lack of any transformer-level paradigm shift in algorithms, architectures, etc.
There are of course other potential paths to human-level (or higher) which don't route through a transformer-level paradigm shift in deep learning. One obvious path is to just keep scaling; I expect we'll see a paradigm shift well before scaling alone achieves human-level AGI (and this seems even more likely post-Chinchilla). The main other path is that somebody wires together a bunch of GPT-style AGIs in such a way that they achieve greater intelligence by talking to each other (sort of like how humans took off via cultural accumulation); I don't think that's very likely to happen near-term, but I do think it's the main path by which 5-year timelines would happen without a paradigm shift. Call it maybe 5-10%. Finally, of course, there's always the "unknown unknowns" possibility.
Back around 2014 or 2015, I was visiting my alma mater, and a professor asked me what I thought about the deep learning wave. I said it looked pretty much like all the previous ML/AI hype cycles: everyone would be very excited for a while and make grand claims, but the algorithms would be super finicky and unreliable. Eventually the hype would die down, and we’d go into another AI winter. About ten years after the start of the wave someone would show that the method (in this case large CNNs) was equivalent to some Bayesian model, and then it would make sense when it did/didn’t work, and it would join the standard toolbox of workhorse ML algorithms. Eventually some new paradigm would come along, and the hype cycle would start again.
… and in hindsight, I think that was basically correct up until transformers came along around 2017. Pre-transformer nets were indeed very finicky, and were indeed shown equivalent to some Bayesian model about ten years after the excitement started, at which point we had a much better idea of what they did and did not do well. The big difference from previous ML/AI hype waves was that the next paradigm - transformers - came along before the previous wave had died out. We skipped an AI winter; the paradigm shift came in ~5 years rather than 10-15.
… and now it’s been about five years since transformers came along. Just naively extrapolating from the two most recent data points says it’s time for the next shift. And we haven’t seen that shift yet. (Yes, diffusion models came along, but those don’t seem likely to become a transformer-level paradigm shift; they don’t open up whole new classes of applications in the same way.)
So on the one hand, I'm definitely nervous that the next shift is imminent. On the other hand, it's already very slightly on the late side, and if another 1-2 years go by I'll update quite a bit toward that shift taking much longer.
Also, on an inside view, I expect the next shift to be quite a bit more difficult than the transformers shift. (I don't plan to discuss the reasons for that, because spelling out exactly which technical hurdles need to be cleared in order to get nets to human level is exactly the sort of thing which potentially accelerates the shift.) That inside view is a big part of why my timelines last year were 10-15 years, and not 5. The other main reasons my timelines were 10-15 years were regression to the mean (i.e. the transformers paradigm shift came along very unusually quickly, and it was only one data point), general hype-wariness, and an intuitive sense that unknown unknowns in this case will tend to push toward longer timelines rather than shorter on net.
Put all that together, and there's a big blob of probability mass on ~5 year timelines; call that 20-30% or so. But if we get through the next couple years without a transformer-level paradigm shift, and without a bunch of wired-together GPTs spontaneously taking off, then timelines get a fair bit lot longer, and that's where my median world is.
My own responses to OpenAI's plan:
These are obviously not intended to be a comprehensive catalogue of the problems with OpenAI's plan, but I think they cover the most egregious issues.
I'm writing a 1-year update for The Plan. Any particular questions people would like to see me answer in there?
I'd add that everything in this post is still relevant even if the AGI in question isn't explicitly modelling itself as in a simulation, attempting to deceive human operators, etc. The more-general takeaway of the argument is that certain kinds of distribution shift will occur between training and deployment - e.g. a shift to a "large reality", universe which embeds the AI and has simple physics, etc. Those distribution shifts potentially make training behavior a bad proxy for deployment behavior, even in the absence of explicit malign intent of the AI toward its operators.
One subtlety which I'd expect is relevant here: when two singular vectors have approximately the same singular value, the two vectors are very numerically unstable (within their span).
Suppose that two singular vectors have the same singular value. Then in the SVD, we have two terms of the form
(where the ui's and vi's are column vectors). That middle part is just the shared singular value s1 times a 2x2 identity matrix:
But the 2x2 identity matrix can be rewritten as a 2x2 rotation R times its inverse RT:
... and then we can group R and RT with U and V, respectively, to rotate the singular vectors:
Since UR and RTV are still orthogonal, the end result is another valid singular vector decomposition of the same matrix.
Upshot: when a singular value is repeated, the singular vectors are defined only up to a rotation (where the dimension of the rotation is the number of repeats of the singular value).
What this means practically/conceptually is that, if two singular vectors have very close singular values, then a small amount of noise in the matrix will typically "mix them together". So for instance, the post shows a plot of singular vectors for the OV matrix, and a whole bunch of the singular values are very close together. Conceptually, that means the corresponding singular vectors are all probably "mixed together" to a large extent. Insofar as they all have roughly-the-same singular value, the singular vectors themselves are underdefined/unstable; what's fully specified is the span of singular vectors with the same singular value.
(In fact, for the singular value distribution shown for the OV matrix in the post, nearly all the singular values are either approximately 10, or approximately 0. So that particular matrix is approximately a projection matrix, and the span of the singular vectors on either side gives the space projected from/to.)
One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don't imply that the plan fails...
I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefer, as capabilities ramp up. The model keeps pursuing those same old goals, but stops doing what we want because the things we wanted were never optimal for the old goals in the first place. Outsourcing goal-preservation to the model should be fine once capabilities are reasonably strong, but goal-preservation isn't actually the main problem which needs to be solved here.
(Or perhaps you're intentionally ignoring that problem by assuming "goal-alignment"?)