Podcast: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas

Orpheus16

TLDR: I interviewed Shoshannah Tekofsky. You can listen to our conversation here or read some highlights below.

Some people in AI safety think that it’s important to have people coming up with fresh, new ideas. There’s a common meme that “thinking for yourself” and “developing your inside view” can be important. But these concepts can feel vague. What does it really mean to “think for yourself” when approaching AI alignment research?

I recently interviewed Shoshannah Tekofsky. For better or for worse, Shoshannah embodies the “think for herself” spirit. After reading List of Lethalities, she started reading about AI alignment (part-time). She then wrote Naive Hypotheses on AI alignment, in which she proposes her own “naive ideas” to solve the alignment problem.

In the interview, we discuss Shoshannah’s journey, her approach to starting off in AI alignment research, some stories from her recent trip to Berkeley, and some of the ways she’s currently thinking about alignment research.

You can listen to the full interview here. I’ve included some highlights below.

Note: I am considering having more conversations about AI and starting a podcast. Feedback is welcome. Also feel free to reach out if you’re interested in helping (e.g., audio editing, transcript editing, generating questions).

Introduction

Akash: Shoshannah is 36 years old. She has a master's in computer science and ancient technology. And she has a PhD in Player Modeling in video games. Three months ago, Shoshannah started skilling up in AI safety. I read a LessWrong post by her called Naive hypotheses on AI alignment. She's spent the last couple of months continuing to skill-up in AI safety, including spending one month in Berkeley, California. I'm excited to talk to Shoshana about how her skilling-up in AI safety has been going, how her experience in Berkeley was, and a bunch of other stuff that will inevitably come up.

Bad advice

Shoshannah: But the advice was absolutely terrible. And I think this is a common way of thinking. He said, “you need to spend about one or two years reading all the existing literature, following up on all the current authorities, and working through that.” And then at the end of that, you might be able to add on to some of the existing frames, or once you've like, written one thing after like two years about, you know, what you think about the existing frames, or how they can be expanded, then, you know, you can start thinking about your own stuff. And this was like, literally his advice. I was like, that is absolutely backward from what I think anybody should be doing right now.

On exploring new and “dumb” ideas

Shoshannah: I think people tend to assume areas are unexplored, because you need to have, you know, an IQ of like 200 or whatever the hell is impossible [in order to explore those ideas]. But actually, if you look at scientific history, that's maybe true for some ideas, like definitely true for some selection of ideas that are just really complicated. But a lot of ideas, especially when a field is very young, tend to be, well, everybody just thought somebody else was doing it. That's one area, because it's just so obvious.

Also, there might be a narrative in the community that this is not a smart idea. And nobody is willing to go against that and explore it seriously.

And then my last thought is, there's a whole area of dumb ideas at first sight-- ideas that just seems ridiculous or weird. And so you don't explore that. From my point of view, the people who take the responsibility and say, I'm able and interested to work on this sort of have the responsibility to explore the part of the solution space that they can explore. And not worry so much about their social status.

Your stupid idea might inspire somebody else to have a smart idea. That's the whole idea of the little sparring thing. And there are examples throughout history where people couldn't predict the effects that they themselves would have or realize that they had a big breakthrough, when actually it was a really big breakthrough. So there's this internal step of exploring new ideas. And to trust that you have the intelligence presumably to come up with good ideas. And to trust that even what starts out as sounding like a stupid idea that you can explore that or discuss that with other people. And just review that as something separate from your own ego. Being like hey, we covered this part of the solution space, we talked about that, and we took it seriously.

Imaginary numbers

Shoshannah: I watched this video series on imaginary numbers just around the time that I learned about the AI alignment problem. And one thing that kind of struck me when I was watching this video series, was it described the history of imaginary numbers being discovered. And there was this part of the story where they described how counterintuitive the entire concept was.

And then it was so counterintuitive, that at least the first person who actually figured it out was like, “oh, this is such a ridiculous and dumb idea”, and he never shared it or published it.

And like after he died, they found the notes years later. I think it was like 100 or 200 years later. And people were like “oh, he was totally right, but he just thought it was stupid.”

Because it was just such a crazy idea. I'm not trying to validate that all crazy ideas are like that. But currently, we're in a position where we could definitely do with more crazy ideas and like actually share them. And it's easier for the community to filter your crazy ideas for new inspiration than to necessarily generate more inspiration.

Berkeley, community, & the Indiana Jones/James Bond approach to alignment

Shoshannah: I was there in the summer. And I tried to meet as many people as I could and attend as many events as I could. And it was really intense. So people work really hard. People are very passionate, they're very deeply into everything they're doing. And the environment is really dynamic.

They're kind of like the Indiana Jones/James Bond approach to solving alignment. People are like, Okay, if it helps to go skydiving tomorrow to solve land, and that's what we're doing. And like, if we don't have a plane, well, like build our own plane and use it to go skydiving tomorrow. Obviously, that's not what we're doing, because it doesn't solve alignment.

But it kind of had that feel of, yeah, if we have this good idea, let's try it. Let's do it right now. bring everybody together. And then it just happens the next week.

There’s no pressure to show up for a certain thing, nothing like that. But there's all this stuff happening, where, you know, people have choir and they have dinner, and then they have a picnic. And then you can do sports, or you can do applied rationality things and like, there's this whole community vibe. If you need anything, somebody can help you. You help each other. And there's like this whole experience that's completely opposite to like, the whole individualization thing. People becoming more lonely in the West, or at least in Europe, that's a big thing.

I instantly felt like I belonged. And that was kind of like an interesting experience. And it's obviously, you know, due to having a lot of overlap and social norms, epistemic norms and comfort. And that combined with, you know, the James Bond/Indiana Jones approach to solving AI, like the AI alignment, just really interesting experience.

Ideas Shoshannah has been thinking about

Shoshannah: So I'm taking this really top-down approach now. And so instead of like, following my curiosity down all the different paths, I'm trying to really start at the top of the problem, and then work my way down, and find the paths I want to walk. And so the first question is, what is the alignment problem? And I'm trying to build my own model of it. I know the whole like, inner outer alignment model with the mesa-optimizers, and all that sort of stuff. And instead, I'm trying to really understand what we'll be doing on, what is the, what is the problem we're trying to solve? How do all the things that I have read now actually fit into that and try to develop my own frame?

And then from that frame, I'm trying to figure out which problems we need to solve and how we might do that and the obvious one, one of the obvious ones that drew my attention is the goodharting issue where any signal can be gamed, and then the signal, the the proxy gets optimized instead of the actual thing. And so one of the ideas I had pretty early on, is this question about what are we even trying to align for? So what do we even want the AI to do?

Because I was wondering if, like, most of the stuff I was reading was solving alignment in sort of a generic sense, instead of a specific way. And there's, to me, it seems logical that you only need to solve for the specific goals that you want an AGI to have, and not for all possible goals we could ever give it. Because we don't actually want it to be a paperclip factory. So like, why are we optimizing on that? Or like, why are we trying to solve for that and like, that's not really an issue, but we can constrain the solution space.

On intuitions

Akash: I hear this a lot, too. I think especially when people are describing like Paul and Eliezer are disagreeing. One of the common things I hear is “oh, you know, they just have, like, different intuitions about how relevant evolution is or how relevant the history of scientific breakthroughs are”. And it's like, “Ah, yes, yes, the smart people just have different intuitions.” So that's a legitimate reason for them to disagree. Why is it a pet peeve of yours [when people refer to intuitions]?

Shoshannah: I think intuition is like your feelings in a lab coat. Like, they're still like your feelings, and I get it. So intuitions are a valid start of a direction you might want to look in. But intuitions need to either be formed into a hypothesis, or they need to be grounded in reasoning. And if you can't do one or the other, then they don't have much place in the discussion about empirical science. Because everybody, everybody's intuitions are equal, because they're your feelings about what might be right. That's all they are. I'm not saying they're not useful, but it comes up a lot and it gets taken seriously as if it's a testable hypothesis or as if it is actual reasoning. But really, you've just bottomed out to “we agree to disagree.” Because feelings.

On having kids

Shoshannah: Yeah, I mean, I noticed that in the community, almost no one has kids. So that's like, definitely a very different world.

My kids are my main motivations. So you know, that kind of like, makes it sort of a unique position. I noticed people are extremely supportive about it, and have very positive responses, which I really appreciate.

I was a stay at home mom for four years, which definitely like shapes your brain a little bit. Like you become pregnant and you have the child and like you have the postpartum stuff and like breastfeeding; really intense. And seriously, man, I have not run into a job that's half as difficult.

And so in comparison working on alignment at the pace that I am now-- it's pretty much every day that I can possibly work on alignment, like every minute I can possibly work on alignment I do-- it's still a less intense schedule than having kids. You know, kids will wake you up in the middle of the night. Like, you might not always feel like cleaning a diaper and like your thoughts get interrupted. Your needs get interrupted on such a visceral level.

While doing research, it’s the opposite. Like you're trying to optimize your own state and your own mind. So you can have like these really deep thoughts and you can push yourself but there's like this sense of control over when you push yourself and like you do it because there's like a reward signal at the end. So to me, yeah, going through motherhood first kind of makes dedicating myself more to research actually kind of easy. It's like a cakewalk in comparison.

On what the alignment problem is & why it might be hard

Shoshannah: Yeah, I think the way I see the challenge right now of doing that, is for one, we don't know what to point it at. That's sort of like this question of like, are there universal human values? Or what goals do we want it to achieve? Do we want it to launch us into space? Do we want it to make sure that humanity is flourishing? What does it mean to be flourishing?

There's this problem of how do we communicate our goals to the AI? And how do we make sure that it doesn't misunderstand those goals?

And then there's the second problem of once you've decided what you want it to do, so say, you want it to, you know, put a rocket on the moon or on Mars or whatever, then there's this second question of how do you make it like not fool you into believing that this thing has been achieved? And I kind of call this a question about how you make it attached to the actual goal.

So you think you're training it in safe environments, simulators, or otherwise sanitized environments, and then you let it out in the world. But now it has so much power, it could actually hurt us. But it was never trained in an environment in which it could hurt us. So how are you sure it's going to do the thing you expect it to do?How are you then sure that whatever it seems to have learned, is actually the thing that it has learned? So that it isn't like deceiving you.

So like, these are these questions about actual misalignment and deception, where an AGI during, an AI during training, if it wants to, for instance, maximize its reward, it can do so, by pretending during training that it has the intended goals. And then as soon as it's out in the world, it switches to a policy, where it actually does the thing it really wants, which might involve hurting us. So you have these sorts of deception patterns.

And then the last problem we have is we can't currently control when AGI is developed, because they're sort of like this raise to the bottom. So like, we might not have actually solved alignment by the time AGI shows up, because we don't have enough coordination going on in the world to make sure that we wait that long. Because you know, there are companies that are working on making alignment that are making AGI that are sensitive to alignment concerns. But if you convince them not to do it, then the companies that are not sensitive to alignment concerns, we'll be the ones to do it.

And if you somehow manage to stop those, then really bad actors or criminals or governments that are, you know, don't have the most prosocial intentions for the rest of the world will develop them instead. So you have all these different stages of problems. And then it gets worse. Because as soon as you've solved all those things, and you think you've done alignment perfectly, and you have like your proof, and it's all great, then, once the AI develops and like takes off and becomes super human in its like general intelligence, only then did you know if you made it or not, and you basically only have one shot. And you either did or you didn't because you can't go back. So that's an issue.

38