May I ask what your cold emails looked like? I'm always curious about what sort of prompts are more likely to get replies, and which are more likely to be left on read.
It was formatted based on typical academic "I am conducting a survey on X, $Y for Z time", and notably didn't mention AI safety. The intro was basically this:
My name is Vael Gates, and I’m a postdoctoral fellow at Stanford studying how productive and active AI researchers (based on submissions to major conferences) perceive AI and the future of the field. For example:
- What do you think are the largest benefits and risks of AI?
- If you could change your colleagues’ perception of AI, what attitudes/beliefs would you want them to have?
My response rate was generally very low, which biased the sample towards... friendly, sociable people who wanted to talk about their work and/or help out and/or wanted money, and had time. I think it was usually <5% response rate for the NeurIPS / ICML sample off the top of my head. I didn't A/B test the email. I also offered more money for this study than the main academic study, and expect I wouldn't have been able to talk to the individually-selected researchers without the money component.
Commentary on the cvgig transcript:
0:20:04.4 Vael: Yeah, it seems right. I would kind of expect that if people really got stuck, they would start pouring effort into interpretability work for other types of things.
I think I don't have this impression? I think when something 'gets stuck', progress normally comes from one of the side branches that turns out not to have the problem that the main branch has, rather than people really doubling down on understanding the main branch and figuring out how far back they have to roll back / what they have to be doing instead.
Like, when I think about Chess and Go bots, there's some stuff you get from 'better transparency' on the old paradigms, but I'm guessing it mostly wouldn't have given you the advances that we actually got. My impression of things in stats/ml more broadly is that we have lots of "Bayesian understandings" of techniques which point out the optimal way to do a technique, and something like 2/3rds of the time they're "here's the story behind the optimality of the technique someone found by futzing around" and only about 1/3rd of the time they're "here's a new technique that we found by thinking about what techniques are good." Somehow this paper comes to mind.
Commentary on the 7ujun trancript:
I think that's more of a sociological question than it is a technical question. The class of problems and the class of algorithms that are considered AI has changed dramatically over the past 50 years
This is actually a pretty good point: when we talk about "AI" instead of "software" or "machines" or whatever, a lot of people think about this narrow thing, which is something like "hypothetical programs on or beyond the edge of scientific / technological development." And so anyone who can tell you where the edge will be in 50 years is obviously overconfident about their predictive ability!
0:34:48.2 Vael: ... Well, it seems like you could train it on one personality if you wanted to, right? If you had enough data for that, which we don't. But if we did. And then I wouldn't really worry about it having different agents in it.
0:35:17.6 Interviewee: That's a very, very, very, very, very, very, very, very large amount of text.
0:35:26.5 Interviewee: Do you any-- do you have any scope of understanding for how much text that is?
It seems worth pointing out that my first question was "wait, is it that large?" and then they do the math and it does seem like that many 'very's is justified.
But some followups:
So a pressingly important question is, to what extent does this interfere with... Let's, to make language easier, call it one of its personalities. Let's say one of its personalities wants to do something in the world: kill all the humans or even something mundane. To what extent does the fact that it's not the only personality interfere with its ability to create and execute plans?
This whole section (I'm quoting a typical passage near the start) seems pretty good. Basically, the Interviewee is reacting to people worried that GPT-3 is going to 'wake up' a la Skynet with: "have you seen how crazily incoherent this thing is?"
Somehow this section reminded me of some recent Eliezer stuff, which I'll summarize as saying that he's pretty confident in his models of what AI will look like eventually, and not what it'll look like soon; as you rise thru the atmosphere, there's a bunch of turbulence, and only after you get thru that is the flight path smooth and predictable.
In this situation, GPT-3 seems like it has lots of personalities and switches between them in an undeliberate way; but presumably the CEO-bot is trying to be coherent, instead of trying to be able to plausibly generate any of the text that it saw on the internet. [The interviewee notes that this incoherency causes capability problems; presumably systems will become more coherent as part of the standard push towards capability!]
[I should note that I think this focuses too much on the human-visible 'incoherence' to be really reassuring that there's not some sort of coherency being trained where we haven't figured out where to look for it yet.]
To clarify a little, Vael initially suggests that you could train GPT-3 from scratch on one human's output to get a safe imitation of 1 specific agent (that human), without any further weirdness. This does seem obviously wrong: there is probably more than enough information in that output to recover the human's personality etc, but one human's lifetime output of text clearly does not encode everything they have learned about the world and is radically inadequate. Sample-efficiency of DL is besides the point, the data just is not there - I have learned far more about, say, George Washington than I have ever written down (because why would I do that?) and no model trained from scratch on my writings will know as much about George Washington as I do.
However, this argument is narrow and only works for text and text outputs. Text outputs may be inadequate, but those two words immediately suggest a response: What about my text inputs? Or inputs beyond text? Multimodal models, which are already enjoying much success now, are the most obvious loophole here. Obviously, if you had all of one specific human's inputs, you do have enough data to train an agent 'from scratch', because that is how that human grew up to be an agent! It is certainly not absurd to imagine recording a lifetime of video footage and body motion, and there are already the occasional fun lifelogging projects which try to do similar things, such as to study child development. Since everything I learned about George Washington I learned from using my eyes to look at things or my ears to listen to people, video footage of that would indeed enable a model to know as much about George Washington as I do.
Unfortunately, that move immediately brings back all of the safety questions: you are now training the model on all of the inputs that human has been exposed to throughout their lifetime, including all the Internet text written by other people. All of these inputs are going to be modeled by the human, and by the model you are training on those inputs. So the 'multiple personality' issue comes right back. In humans, we typically have a strong enough sense of identity that words spoken by other humans can't erase or hijack our identity... typically. (Your local Jesus or Napoleon might beg to differ.) With models, there's no particular constraint by default from a giant pile of parameters learning to predict. If you want that robustness, you're going to have to figure out how to engineer it.
I disagree with their characterization of DRL, which is highly pessimistic and in parts factually wrong (eg. I've seen plenty of robot policy transfer).
I agree with them about thinking of GPT-3 as an ensemble of agents randomly sampled from the Internet, but I think they are mostly wrong about how hard coherency/consistency is or how necessary it is; it doesn't strike me as all that hard or important, much less as the most critical and important limitation of all.
Of course starting with an empty prompt will be gibberish incoherent, since you're unlikely to sample the same agent twice, but the prompt can easily keep identity on track and if you pick the same agent, it can be coherent (or at least, it seems to me that the interviewee is relying heavily on the 'crowdsourced distribution' being incoherent with itself - which of course it is, just as humanity as a whole is deeply incoherent - but punting on the real question which is whether any agent GPT-3 can execute can be coherent, which is either yes or seems like further scaling/improvements would render more coherent). GPT-3 is inferring a latent variable in the POMDP of modeling people; it doesn't take much evidence to update the variable to high confidence. (Proof: type in "Hitler says" - or "Gwern says*" - to your friendly local GPT-3-scale model. That's all of... <40 bits? or like 4 tokens.) The more history it has, the more it is conditioning on in inferring which MDP it is in and who it is. This prompt or history could be hardwired too, note, it doesn't even have to be text, look at all the research on developing continuous prompts or very lightweight finetuning.
Also, consistency of agent may be overrated given how extremely convergent goals are given circumstances (Veedrac takes this further than my Clippy story). The real world involves a few major decisions which determine much of your life, filled in by obvious decisions implementing the big ones or which are essentially irrelevant like deciding what movie to watch to unwind tonight; the big ones, being so few, are easy to make coherent, and the little ones are either so obvious that agents would have to be extremely incoherent to diverge or it doesn't matter. If you wake up in a CEO's chair and can do anything, nevertheless, the most valuable thing is probably going to involve playing your role as CEO and deal with the problem your subordinate just brought you; the decision that mattered, where agents might disagree, was the one that made you CEO in the first place, but now that is a done deal. Or more concretely: I can, for example, predict with great confidence that the majority of humans on earth, if they were somehow teleported into my head right now like some body-swap Saturday morning cartoon, would shortly head to the bathroom, and this is the result of a decision I made several hours ago involving tea; and I can predict with even greater confidence that they will do so by standing up and walking into the hallway and walking through the doorway (as opposed to all the other actions they could have taken, like wriggle over the floor like a snake or try to thrust their head through the wall). A GPT-3 which is instantiating a 'cloud' of agents around a prompt may be externally indistinguishable from a 'single' human-like agent (let's pretend that humans are the same exact agent every day and are never inconsistent...), because they all want fairly similar things and when instantiated, all wind up making pretty much the same exact choices, with the variations and inconsistencies being on minor things like what to have for breakfast.
(It's too bad Vael didn't ask what those suggested experiments were, it might've shed a lot of light on what the interviewee thinks. We might not disagree as much as I think we do.)
* if you were wondering, first completion in playground goes
Gwern says: I found [this](https://www.reddit.com/r/rational/comments/b0vu8z/the_sequences_an_evergrowing_rationalist_bible/) on reddit. It's a collection of sequences, which are basically essays written by Scott Alexander. He's a psychiatrist who writes a lot about rationality, and these sequences are basically his attempt to explain the basics of rationality to people. I found them really helpful, and I thought other people might find them helpful too.
It may not have located me in an extremely precise way, but how hard do you think agents sampled from this learned distribution of pseudo-gwerns would find it to coordinate with each other or to continue projects? Probably not very. And to the extent they don't, why couldn't that be narrowed down by a history of 'gwern' declaring what his plans are and what he is working on at that moment, which each instantiated agent will condition on when it wakes up and try to predict what 'gwern' would do in such a situation?
I've been finding "A Bird's Eye View of the ML Field [Pragmatic AI Safety #2]" to have a lot of content that would likely be interesting to the audience reading these transcripts. For example, the incentives section rhymes with the type of things interviewees would sometimes say. I think the post generally captures and analyzes a lot of the flavor / contextualizes what it was like to talk to researchers.
Commentary on the a0nfw transcript:
I ended up not having anything interesting to say about this one.
Two authors gave me permission to publish their transcripts non-anonymously! Thus:
Interview with Michael L. Littman (https://docs.google.com/document/d/1GoSIdQjYh21J1lFAiSREBNpRZjhAR2j1oI3vuTzIgrI/edit?usp=sharing)
Interview with David Duvenaud (https://docs.google.com/document/d/1lulnRCwMBkwD9fUL_QgyHM4mzy0al33L2s7eq_dpEP8/edit?usp=sharing)
A lot of good reasons by these experts why real AI is a very long way off, but it makes you wonder if deep learning could unexpectedly lead to something profoundly different but just as powerful as awareness, like instead of an organism trying to survive in the world it leads to a series of profound terminal insights.
tldr: I conducted a series of interviews with 11 AI reseachers to discuss AI safety, which are located here: TRANSCRIPTION LINK. If you are interested in doing outreach with AI researchers, I highly recommend taking a look!
[Cross-posted to the EA Forum.]
Overview
I recently conducted a series of interviews with 11 AI researchers, wherein I laid out some reasons to be concerned about long-term risks from AI.
These semi-structured interviews were 40-60 minutes long and conducted on Zoom. Interviewees were cold-emailed, were paid for their participation, and agreed that I may share their anonymized transcripts.
Six of the interviews were with researchers who had papers accepted at NeurIPS or ICML in 2021. Five of the interviews were with researchers who were informally categorized as “particularly useful to talk to about their opinions about safety” (generally more senior researchers at specific organizations).
I’m attaching the raw transcripts from these 11 interviews, at the following link. I’ve also included the approximate script I was following, post-interview resources I sent to interviews, and informal interview notes in the associated “README” doc. Ideally I’d have some analysis too, and hopefully will in the future. However, I think it’s useful— particularly for people who plan to start similar projects— to read through a couple of these interviews, to get an intuitive feel for what conversations with established AI researchers can feel like.
Note: I also interviewed 86 researchers for a more complete academic, under-IRB study (whose transcripts won’t be released publicly), whose results will be posted about separately on LessWrong once I finish analyzing the data. There will be substantially more analysis and details in that release; this is just to get some transcripts out quickly. As such, I won't be replying to a lot of requests for details here.
Thanks to Sam Huang, Angelica Belo, and Kitt Morjanova, who helped clean up the transcripts!
Personal notes
TRANSCRIPTION LINK