I do not see you as failing to be a team player re: existential risk from AI.
I do see you as something like ... making a much larger update on the bias toward simple functions than I do. Like, it feels vaguely akin to ... when someone quotes Ursula K. LeGuin's opinion as if that settles some argument with finality?
I think the bias toward simple functions matters, and is real, and is cause for marginal hope and optimism, but "bias toward" feels insufficiently strong for me to be like "ah, okay, then the problem outlined above isn't actually a problem."
I do not, to be clear, believe that my essay contains falsehoods that become permissible because they help idiots or children make inferential leaps. I in fact thought the things that I said in my essay were true (with decently high confidence), and I still think that they are true (with slightly reduced confidence downstream of stuff like the link above).
(You will never ever ever ever ever see me telling someone a thing I know to be false because I believe that it will result in them outputting a correct belief or a correct behavior; if I do anything remotely like that I will headline explicitly that that's what I'm doing, with words like "The following is a lie, but if you pretend it's true for a minute you might have a true insight downstream of it.")
(That link should take you to the subheading "Written April 2, 2022.")
I think that we don't know what teal shape to draw, and that drawing the teal shape perfectly would not be sufficient on its own. In future writing I'll try to twitch those two threads a little further apart.
LW is giving me issues and I'm having a hard time getting to and staying on the page to reply; I don't know how good my further engagement will be, as a result.
if we don't know how to choose the right data, the network might not generalize the way we want
I want to be clear that I think the only sane prior is on "we don't know how to choose the right data." Like, I don't think this is reasonably an "if." I think the burden of proof is on "we've created a complete picture and constrained all the necessary axes," à la cybersecurity, and that the present state of affairs with regards to LLM misalignment (and all the various ways that it keeps persisting/that things keep squirting sideways) bears this out. The claim is not "impossible/hopeless," but "they haven't even begun to make a case that would be compelling to someone actually paying attention."
(iiuc, people like Paul Christiano, who are far more expert than me and definitely qualify as "actually paying attention," find the case more plausible/promising, not compelling. I don't know of an intellectual with grounded expertise whom I respect who is like "we're definitely good, here, and I can tell you why in concrete specifics." The people who are confident are clearly hand-waving, and the people who are not hand-waving are at best tentatively optimistic. re: but your position is hand-wavey, too, Duncan—I think a) much less so, and b) burden of proof should be on "we know how to do this safely" not "exhaustively demonstrate that it's not safe.")
I am interested in an answer to Joe's reply, which seems to me like the live conversational head.
MIRI staff member, modestly senior (but not a technical researcher), this conversation flagged to my attention in a work Slack msg
The take I'm about to offer is my own, and iirc has not been seen or commented on by either Nate nor Eliezer and my shoulder-copies of them are lukewarm about it at best
Nevertheless I think it is essentially true and correct, and likely at least mostly representative of "the MIRI position" insofar as any single coherent one exists; I would expect most arguments about what I'm about to say to be more along the lines of "eh, this is misleading in X or Y way, or will likely imply A or B to most readers that I don't think is true, or puts its emphasis on M when the true problem is N" as opposed to "what? Wrong."
But all things considered, it still seems better to try to speak a little bit on MIRI's behalf, here, rather than pretending that I think this is "just my take" or giving back nothing but radio silence. Grains of salt all around.
The main reason why the selective breeding objection seems to me to be false is something like "tiger fur coloration" + "behavior once outside of the training environment." I'll be drawing heavily on this previous essay, which also says a bunch of other stuff.
Tigers are bright orange. They're not quite as incredibly visible in the jungle as one might naively think if one considers orange-on-green-foliage; in fact their striping does a lot of work even for human color vision.
But nevertheless, the main selection pressure behind their coloration came from prey animals who do a poor job of distinguishing oranges and reds from greens and browns. The detection algorithms they needed to evade don't care about how visible bright orange is to humans, just about whether the overall gestalt works on deer/gazelles/antelopes/etc.
In other words: the selective breeding (in this case accomplished by natural selection rather than intentional selection, but still) produced a result that "squirted sideways" on an axis that was not being monitored and not itself under selection pressure.
Analogously: We should expect the evolution of AI systems to be responsive to selection pressures imposed upon them by researchers, but we should also expect them to be responsive only on the axes we actually know to enpressure. We do not have the benefit that a breeding program done on dogs or humans has, of having already "pinned down" a core creature with known core traits and variation being laid down in a fairly predictable manner. There's only so far you can "stretch," if you're taking single steps at a time from the starting point of "dog" or "human."
Modern AIs are much more blue-sky. They're much more unconstrained. We already have no fucking idea what's going on under the hood or how they are doing what they are doing, cf. that time on twitter that some rando was like "the interpretability guys are on it" and the actual interpretability guys showed up to say "no, we are hopelessly behind."
We have systems that are doing who knows what, in a manner that is way more multidimensional and unconstrained than dogs or humans, and we're applying constrictions to those systems, and there is very little (no?) reason to anticipate that those constrictions are sufficient/hit all of the relevant axes of variance. We don't know where our processes are "colorblind," and leaving tigers with bright orange fur, and we don't know when we'll hit scenarios in which the orange-ness of the tigers' fur suddenly becomes strategically relevant.
To say the same thing in a different way:
When making an AI like Claude, approximately the best that we can do is to select the systems which output more-or-less desired behavior in the training and testing environments. We try everything, and then we cherry-pick the stuff that works.
(This is a little bit oversimplified, but that’s the gist—we don’t actually know how to make a large language model “want” to be nice to people, or avoid telling them how to build bombs. We just know that we can pick and prune and clone from the ones that just so happen to tend to be nicer and less terroristic.)
(And then put patches on them after the fact, when they’re released into the wild and people inevitably figure out how to get them to say the n-word anyway.)
There is no point in the process at which a programmer types in code that equals “give honest answers” or “distinguish cats from dogs” or “only take actions which are in accordance with human values.” There are simply tests which approximate those things, and systems which do better at passing the tests get iterated on and eventually released.
But this means that the inner structure of those systems—the rules they’re actually following, their hidden, opaque processes and algorithms—don’t actually match the intent of the programmers. I caution you not to take the following metaphor too far, because it can give you some false intuitions—
(Such as tricking you into thinking that the AI systems we’re talking about are necessarily self-aware and have conscious intent, which is not the case.)
—but it’s sort of like how parents and schools and churches and teams impose rules of behavior on their children, hoping that those rules will manage to convey some deeper underlying concepts like empathy or cooperation or piety…
…but in fact, many children simply work out how to appear compliant with the rules, on the surface, avoiding punishment while largely doing whatever they please when no one is looking, and developing their own internal value system that’s often unrelated or even directly contra the values being imposed on them from without. And then, once the children leave the “training environment” of childhood, they go out into the world and express their true values, which are often startlingly different despite those children having passed every test and inspection with flying colors.
The point here is not “AI systems are being actively deceptive” so much as it is “there are many, many, many different complex architectures that are consistent with behaving ‘properly’ in the training environment, and most of them don’t resemble the thing the programmers had in mind.” Any specific hoped-for goal or value is a very small target in a very large space, and there’s no extra magic that “helps” the system figure out what it’s “really” supposed to be doing. It’s not that the AI is trying to pass the test while actually being shaped rather differently than expected, it’s just that the only constraint on the AI’s shape is the tests. Everything else can mutate freely.
Metaphorically, if the teal lines below represent the tests, then the developers probably were trying to grow something like the black shape:
…however, each of these black shapes is basically just as good at passing that particular test:
…despite the fact that those shapes have very different properties from each other, as well as from the intended shape, and will “behave” very differently in some contexts.
(“Ah, okay,” you might think. “The problem is that they left the teal lines open, meaning that there was room for the thing to ‘grow outward.’ We just need to specify a closed space.” And then (metaphorically) you draw a closed shape in the two-dimensional space of the screen, and the thing grows outward in the third dimension, or shatters fractally inward, or any number of other possibilities that we can’t confidently conclude we’ve definitely found and closed off.)
Another way to say this is that training and testing is meant to help us find and iterate on systems which share the terminal goals of the developers, but in practice, that sort of process can’t actually distinguish between [a system with a terminal goal of X] and [a system with a terminal goal of Y but a local, instrumental goal of X].
For every system that really actually “wants” to do X, there will be myriad similar systems that are oriented around Y or Z or ?? in a fundamental sense, but which happen to do X under the particular constraints of the training environment.
And since we can’t just look directly at the code and comprehend its structure and purpose, there’s no way for us to be confident that we know what the system is “really” trying to do except to deploy it and see what happens.
(To be clear: these are not merely hypothetical concerns; we already have real-world examples of modern AI systems behaving one way under narrow constraints and very differently in the broader environment, or optimizing for a proxy at the expense of the intended goal. This is what one would expect, given the way these systems are developed—it’s basically a sped-up version of the selection pressures that drive biological evolution, which produced humans who optimize for orgasms moreso than for reproduction and invented condoms as soon as they could.)
This is my answer to the common objection “But don’t these systems just do what we tell them to do?” We don’t, in fact, tell them. We select for them, which is not at all the same thing.
An objection I didn't have time for in the above piece is something like "but what about Occam, though, and k-complexity? Won't you most likely get the simple, boring, black shape, if you constrain it as in the above?"
To which my answer is "tiger stripes." You will indeed find the simple, boring black shape more easily on the two-dimensional axis of the screen. But the possibility space is vast, and there are many many more dimensions in play. Imagine, if it helps, that instead of the middle-schooler-accessible version above, what I actually drew was a bunch of examples whose cross-sections are all the simple, boring, black shape, and whose weirdnesses all lie in the unconstrained third dimension.
We can selectively "breed" AIs to behave a certain way, and succeed at that to a limited degree, but the AIs are way weirder than dogs or humans, and thus should be expected to at least have the capacity to behave "properly" in certain ways while actually being dangerously not-at-all-the-thing-that-behavior-would-imply, given the constraint of also being an already-known creature whose properties are at least pinned down to a reasonably finite space. A gazelle would not predict that a tiger's fur is actually an insanely visible bright color, and a gazelle would be wrong.
EDIT: Perhaps a better intuition pump: We selectively bred bulldogs and pugs but couldn't get them to be bulldogs and pugs as we wanted them without the breathing issues; imagine something like that except with many more axes of surprising variance. Even selective breeding starting with known, constrained critters and clearly defined targets doesn't actually go well, and this is neither constrained nor clearly targeted.
To be clear, it's not because we agree with Buck's model. It's more that Eliezer has persistent health and stamina issues and others (Nate, Malo, etc.) need to step up and receive the torch.
I personally would not recommend financial support of MIRI, because I'm worried it will amplify net negative communications from him
Small note: Eliezer is largely backing off from direct comms and most of our comms in the next year will be less Eliezer's-direct-words-being-promoted than in the past (as opposed to more). Obviously still lots of Eliezer thoughts and Eliezer perspectives and goals, but more filtered as opposed to less so. Just FYI.
It's going to depend a lot on the social bubble/which group of friends. It's not outrageous for the social circles I run in, which are pretty liberal/West Coast, but it would be outrageous for some bubbles I consider to be otherwise fun and fine and healthy.
Mainly it leans into the archetype of games like Truth or Dare, or Hot Seat, which are sort of canonically teenage party games and thus often trying to loosen those particular strictures.
Copying over a comment from the EA forum (and my response) because it speaks to something that was in some earlier drafts, that I expect to come up, and that is worth just going ahead and addressing imo.
IMO it would help to see a concrete list of MIRI's outputs and budget for the last several years. My understanding is that MIRI has intentionally withheld most of its work from the public eye for fear of infohazards, which might be reasonable for soliciting funding from large private donors but seems like a poor strategy for raising substantial public money, both prudentially and epistemically.
If there are particular projects you think are too dangerous to describe, it would still help to give a sense of what the others were, a cost breakdown for those, anything you can say about the more dangerous ones (e.g. number of work hours that went into them, what class of project they were, whether they're still live, any downstream effect you can point to, and so on).
My response:
(Speaking in my capacity as someone who currently works for MIRI)
I think the degree to which we withheld work from the public for fear of accelerating progress toward ASI might be a little overrepresented in the above. We adopted a stance of closed-by-default research years ago for that reason, but that's not why e.g. we don't publish concrete and exhaustive lists of outputs and budget.
We do publish some lists of some outputs, and we do publish some degree of budgetary breakdowns, in some years.
But mainly, we think of ourselves as asking for money from only one of the two kinds of donors. MIRI feels that it's pretty important to maintain strategic and tactical flexibility, to be able to do a bunch of different weird things that we think each have a small chance of working out without exhaustive justification (or post-hoc litigation) of each one, and to avoid the trap of focusing only on clearly legible short chains of this—>that (as opposed to trying both legible and less-legible things).
(A colleague of mine once joked that "wages are for people who can demonstrate the value of their labor within a single hour; I can't do that, which is why I'm on a salary." A similar principle applies here.)
In the past, funding MIRI led to outputs like our alignment research publications and the 2020/2021 research push (that didn't pan out). In the more recent past, funding MIRI has led to outputs like the work of our technical governance team, and the book (and its associated launch campaign and various public impacts).
That's enough for some donors—"If I fund these people, my money will go into various experiments that are all aimed at ameliorating existential risk from ASI, with a lean toward the sorts of things that no one else is trying, which means high variance and lots of stuff that doesn't pan out and the occasional home run."
Other donors are looking to more clearly purchase a specific known product, and those donors should rightly send fewer of their dollars to MIRI, because MIRI has never been and does not intend to ever be quite so clear and concrete and locked-in.
(One might ask "okay, well, why post on the EA forum, which is overwhelmingly populated by the other kind of donor, who wants to track the measurable effectiveness of their dollars?" and the answer is "mostly for the small number who are interested in MIRI-like efforts anyway, and also for historical reasons since the EA and rationality and AI safety communities share so much history." Definitely we do not feel entitled to anyone's dollars, and the hesitations of any donor who doesn't want to send their money toward MIRI-like efforts are valid.)
Most of the seven Extended Discussions under chapter 4 of Nate and Eliezer's book's supplementals are basically an expansion of this thesis (which I also agree with and think is true).
I do not see you as failing to be a team player re: existential risk from AI.
I do see you as something like ... making a much larger update on the bias toward simple functions than I do. Like, it feels vaguely akin to ... when someone quotes Ursula K. LeGuin's opinion as if that settles some argument with finality?
I think the bias toward simple functions matters, and is real, and is cause for marginal hope and optimism, but "bias toward" feels insufficiently strong for me to be like "ah, okay, then the problem outlined above isn't actually a problem."
I do not, to be clear, believe that my essay contains falsehoods that become permissible because they help idiots or children make inferential leaps. I in fact thought the things that I said in my essay were true (with decently high confidence), and I still think that they are true (with slightly reduced confidence downstream of stuff like the link above).
(You will never ever ever ever ever see me telling someone a thing I know to be false because I believe that it will result in them outputting a correct belief or a correct behavior; if I do anything remotely like that I will headline explicitly that that's what I'm doing, with words like "The following is a lie, but if you pretend it's true for a minute you might have a true insight downstream of it.")
(That link should take you to the subheading "Written April 2, 2022.")
I think that we don't know what teal shape to draw, and that drawing the teal shape perfectly would not be sufficient on its own. In future writing I'll try to twitch those two threads a little further apart.