From the 80,000 Hours Podcast (1:10:38):
Robert Wiblin: A few weeks ago, we published our conversation with Pushmeet Kohli, who’s an AI robustness and reliability researcher at DeepMind over in London. To heavily summarize Pushmeet’s views, I think he might have made a couple of key claims.
One was that alignment and robustness issues, in his view, appear everywhere throughout the development of machine learning systems, so they require some degree of attention from everyone who’s working in the field. According to Pushmeet, this makes the distinction between safety research and non-safety research somewhat vague and blurry, and he thinks people who are working on capabilities are kind of also helping with safety, and improving reliability also improves capabilities, because you can then actually design algorithms that do what you want.
Secondly, I think he thought that an important part of reliability and robustness is going to be trying to faithfully communicate our desires to machine learning algorithms, and that this is kind of analogous to – although a harder instance of – the challenge of just communicating with other people, getting them to really understand what we mean. Although of course it’s easier to do that with other humans than with animals or machine learning algorithms.
A third point was, I guess, just a general sense of optimism: that DeepMind is working on this issue quite a lot and are keen to hire more people to work on these problems, and I guess a sense that probably we’re going to be able to gradually fix these problems with AI alignment as we go along and machine learning algorithms get more influential.
I know you haven’t had a chance to listen to the whole interview, but you skimmed over the transcript. Firstly, where do you think Pushmeet is getting things right? Where do you agree?
Paul Christiano: So I certainly agree that there’s this tight linkage between getting AI systems to do what you want and making them more capable. I agree with the basic optimism that people will need to address the “getting AI systems to do what we want” problem. I think it is more likely than not that people will have a good solution to that problem.
Maybe there’s this interesting intervention of, “Should longtermists be thinking about that problem in order to increase the probability?” I think even absent the actions of the longtermists, there’s a reasonably good chance that everything would just be totally fine. So in that sense, I’m on board with those claims, definitely.
I think that I would disagree a little bit in thinking that there is a meaningful distinction between activities whose main effect is to change the date by which various things become possible, and activities whose main effect is to change the trajectory of development.
I think that’s the main distinguishing feature of “working on alignment” per se. You care about this differential progress towards being able to build systems to do what we want. I think in that perspective, it is the case that the average contribution of AI work is almost by definition zero on that front, because if you just increased all the AI work by a unit, you’re just bringing everything forward by one unit.
And so I think that does mean there’s this well-defined thing which is, “Can we change the trajectory in a way?” And that’s an important problem to think about.
I think there’s also a really important distinction between the kind of failure which is most likely to disrupt the long-term trajectory of civilization, and the kind of failure which is most likely to be an immediate deal-breaker for systems actually being useful or producing money. And maybe one way to get at that distinction is related to the second point you mentioned.
Communicating your goals to an ML system is very similar to communicating with a human. I think there is a hard problem of communicating your goals to an ML system, which we could view as a capabilities problem. Are they able to understand things people say? Are they able to form the kind of internal model that would let them understand what I want? In some sense, it’s very similar to the problem of predicting what Paul would do, or it’s a little slice of that problem – predicting under what conditions Paul would be happy with what you’ve done.
That’s most of what we’re dealing with when we’re communicating with someone. If I’m talking with you, I would be completely happy if I just managed to give you a perfect model of me – then the problem is solved. I think that’s a really important AI difficulty for making AI systems actually useful.
I think that’s less the kind of thing that could end up pushing us in a bad long-run direction, mostly because we’re concerned about behavior as AI systems become very capable and have a very good understanding of the world around them, of the people they’re interacting with; and the really concerning cases are ones where AI systems actually understand quite well what people would do under various conditions, understand quite well what they want – what we think about as normal communication problems between people – that understand what Paul wants, but aren’t trying to help Paul get what he wants. And I think that a lot of the interesting difficulty, especially from a very long-run perspective, is really making sure that no gap opens up there.
Again, there's a gap between the problems that are most important on the very long-run perspective and the problems that people will most be confronting in order to make AI systems economically valuable.
I do think that there’s a lot of overlap – problems that people are working on, that make AI systems more valuable, are also helping very directly with the long-run outcome. But I think if you’re interested in differentially changing the trajectory or improving the probability that things go well over the long term, you’re more inclined to focus precisely on those problems which won’t be essential for making AI systems economically useful in the short term. And I think that’s really distinctive to what your motivation is or how you’re picking problems or prioritizing problems.
Robert Wiblin: One of the bottom lines for Pushmeet, I guess, was that people who want to make sure that AI goes well, they needn’t be especially fussy about whether they’re working on something that’s safety-specific or on something that’s just about building a new product that works well using machine learning.
Sounds like you’re a little bit more skeptical of that, or you think ideally people should in the medium term be aiming to work on things that seem like they disproportionately push on robustness and reliability?
Paul Christiano: Yeah, I think people who are mostly concerned about the long-term trajectory face this dilemma in every domain. If you live in the world where you think that almost all of the most serious challenges to humanity are caused by things humans are doing – by things not only that humans are doing, but by things humans are doing that we would often think of as part of productive progress, part of the goal – like we’re building new technologies, but those technologies are also the things that pose the main risks – then you have to be picky if you’re a person who wants to change the long-term trajectory.
You're just sort of like, "I probably am helping address those problems if I just go do a random thing – I work on a random project, make a random product better. I am helping address the kinds of problems we're concerned about. But I’m also at the same time contributing to bringing those problems closer to us in time."
And it’s sort of roughly a wash, if you’re on the average product, making the average product work. And there are subtle distinctions we could make, of like – I think if you are motivated to make products work well, if you’re like, “Not only do I want to do the thing that’s most economically valuable, I want to have more of an emphasis on making this product robust,” I think you’re just generally going to make a bunch of low-level decisions that will be helpful.
I definitely think you can have a pretty big impact by being fussy about which problems you work on.
Robert Wiblin: I guess there’s this open question of whether we should be happy if AI progress across the board just goes faster. What if, yeah, we can just speed up the whole thing by 20%? Both all of the safety and capabilities. As far as I understand, there’s kind of no consensus on this. People vary quite a bit on how pleased they’d be to see everything speed up in proportion.
Paul Christiano: Yeah, I think that’s right. I think my take, which is a reasonably common take, is it doesn’t matter that much from an alignment perspective. Mostly, it will just accelerate the time at which everything happens.
And there are some second-order terms that are really hard to reason about, like, “How good is it to have more or less computing hardware available?” Or ”How good is it for there to be more or less kinds of other political change happening in the world prior to the development of powerful AI systems?”
There’s these kind of higher-order questions where people are very uncertain of whether it’s good or bad. But I guess my take would be that the net effect there is kind of small, and the main thing is I think accelerating AI matters much more on the next-100-years perspective.
If you care about welfare of people and animals over the next 100 years, then acceleration of AI looks reasonably good. So I think the main upside of faster AI progress is that people are going to be happy over the short term. I think if you care about the long term, it is roughly a wash, and people could debate whether it’s slightly positive or slightly negative but mostly it’s just accelerating where we’re going.
Robert Wiblin: Yeah, this has been one of the trickier questions that we’ve tried to answer in terms of giving people concrete career advice.
It seems to me if you’re someone who has done a PhD in ML or is very good at ML, but you currently can’t get a position that seems especially safety-focused or that is going to disproportionately affect safety more than capabilities, it is probably still good to take a job that just advances AI in general, mostly because you’ll be reaching the cutting edge potentially of what’s going on and improving your career capital a lot and having relevant understanding.
And the work, I guess you kind of think, is kind of close to a wash. It speeds things up a little bit – like, everything goes in proportion. It’s not clear whether that’s good or bad. But then you can potentially later on go and work on something that’s more alignment-specific, and that is the dominant term in the equation. Does that seem reasonable?
Paul Christiano: Yeah, I think that seems basically right to me. I think there’s some intuitive hesitation with the family of advice that’s like, “You should do this thing, which we think is roughly a wash on your values now, but there will be some opportunity in the future where you can sort of make a call.” I think there’s some intuitive hesitation about that, but I think that is roughly right.
Imagine if you offered to Paul two possible worlds. In one, there’s twice as many people working on machine learning and AI, but half of them really care about the long term and ensuring that AI is developed in a way that’s good for humanity’s long term. I'm like, "That sounds like a good trade."
We maybe then have less opportunity to do work right now. I think that’s the main negative thing. There will be less time to think about the alignment problem per se. But on the other hand, it just seems really good if a large fraction of the field really cares about making things go well. I just expect a field that has that character to be much more likely to handle issues in a way that’s good for the long term.
And I think you can sort of scale that down. It’s easiest for me to imagine the case where a significant fraction of the field is like that, but I think that if anything, the marginal people at the beginning are having probably a better cost-benefit analysis for them.
Robert Wiblin: Yeah, I guess I was suggesting that this would be the thing to do if you couldn’t get a job that was alignment-specific already. Say that they want to join your team but they’re just not quite good enough yet, they need to learn more. Or potentially, there’s just only so fast that the team can grow, so even though they’re good, you just can’t hire as quickly as people are coming on board.
But I suppose you have to make sure that, yeah, if people are going into these roles that we currently think are kind of just neutral, but good for improving their skills, that they don’t forget about that. That the original plan was at some point to switch to something different.
I guess there is a bit of a trap. It seems like people just in general tend to get stuck in doing what they’re doing now and convince themselves that whatever they’re doing is actually really useful. So you might think, “Yeah, it would be good to go in and then switch out," but you might have some doubts about whether in fact you will follow through on that.
Paul Christiano: Yeah, I think that’s right. I would be even happier, certainly, in the world where you took those half of people who might have gone into ML, and you instead moved them all into really thinking deeply about the long term and how to make things go well. That sounds like an even better world still.
If someone really cared about the long term and were like, “What should I do,” it’s a reasonably good option to just be like, “Go do this thing which is good on the short term and adjacent to an area we think is going to be really important over the long term.”
Robert Wiblin: There’s been this argument over the years that it would just be good in some way that we can’t yet anticipate to have people at the cutting edge of machine learning research who are concerned about the long term and alert to safety issues and alert to alignment issues that could have effects on the very long term. And people have gone back and forth on how useful that actually would be, to just be in the room where decisions are getting made.
It's kind of occurred to me that it seems like the machine learning community is really moving in the direction of sharing the views that you and I hold. A lot of people are just becoming concerned about “Will AI be aligned in the long term?” And it might be that if you’re particularly concerned about that now, then maybe that makes you different from your peers right now, but in 10 years’ time or 20 years’ time everyone will have converged on a similar vision as we have a better idea of what machine learning actually looks like and what the risks are when it’s deployed.
Paul Christiano: Yeah, I think that’s an interesting question, or an interesting possible concern with that kind of approach. I guess my take would be that there are some differences – I don’t know if you’d call them values differences or deep empirical or worldview differences – that are relevant here. Where I think to the extent that we’re currently thinking about problems that are going to become real problems, it’s going to be much, much more obvious that there are real problems.
And I think that to the extent that some of the problems we think about over the very long term are already obviously problems, people in the ML community are very interested in problems that are obviously problems – or, problems that are affecting the behavior of systems today.
Like, again, if these problems are real, that’s going to become more and more the case over time, and so people will become more and more interested in those problems.
I still think there is this question of, "How much are you interested in making the long term go well, versus how much are you doing your job or pursuing something which has a positive impact over the short term, or that you’re passionate about or interested in this other non-long-term impact of?" I do think there are just continuously going to be some calls to be made or some different decisions.
The field embodies some set of values. I think that people’s empirical views are changing more than the set of implicit values that they have. I think if you just said everyone who really cares about the long term isn’t going into this area, then the overall orientation the field will persistently be different.
Robert Wiblin: Do you have any views on the particular technical approaches that Pushmeet mentioned in the episode or that the DeepMind folks have written up on their safety blog?
Paul Christiano: The stuff I’m most familiar with that Pushmeet’s group is working on is verification for robustness to perturbations, some working on verification more broadly, and some working on adversarial training and testing. Maybe those are the three things, I don’t know if there’s something else. I’m happy to go through those in order.
Robert Wiblin: Yeah, go through those.
Paul Christiano: So, I guess I’m generally pretty psyched about adversarial testing and training and verification. That is, I think there is this really important problem over both – this is one of those things at the intersection of "it matters over the short term, I think it matters (maybe even more) over the very long term" – of, you have some AI system, you want to delegate a bunch of work to maybe not just one but a whole bunch of AI systems. If they failed catastrophically, it would be really unrecoverably bad.
You can’t really rule out that case with traditional ML training, because you’re just going to try a thing on a bunch of cases that you’ve generated so far, experienced so far. So your training process isn’t at all constraining on this potential catastrophic failure in a new situation that comes up.
So we just want to have something, we want to change the ML training process to have some information about what would constitute a catastrophic failure and then not do that. I think that’s a problem that is in common between the short and long term. I think it matters a lot on the long term. It’s a little bit hard to say whether it’s more on the long term or short term, but I care about it a lot on the long term.
I think that the main approaches we have to that are – the three I really think about are adversarial training and testing, verification, and interpretability or transparency.
I just think people getting familiar with those techniques, becoming good at them, thinking about how you would apply them to richer kinds of specifications, how you grapple with the fundamental limitations in adversarial training where you have to rely on the adversary to think of a kind of case...
The way the technique works in general is that you're like, “I’m concerned about my system failing in the future. I’m going to have an adversary who’s going to generate some possible situations under which the system might fail. And then we’re going to run on those and see if it fails catastrophically.” You have this fundamental limitation where your adversary isn’t going to think of everything.
People are getting experience with, "How do we grapple with that limitation?" In some sense, verification is a response to that limitation. I think it’s productive to have people thinking about both verification and the limits of verification, and testing and the limits of testing. So overall I’m pretty excited about all of that.
Robert Wiblin: Do you share Pushmeet’s general optimism?
Paul Christiano: I don’t know quantitatively exactly how optimistic he is. My guess would be that I’m less optimistic, in the sense that I’m like, “Well, there’s like tens of percent chance that we'll mess this up and lose the majority of the value of the future.” Whereas listening to him, that's not the overall sense I get of where he’s at.
Robert Wiblin: It's not the vibe.
Paul Christiano: Yeah. But it’s a little bit hard to know how to translate between a vibe and an actual level of optimism.
Robert Wiblin: Yeah, it is interesting. Someone could think there's a 20% chance that we’ll totally destroy everything, but still just have kind of a cheerful disposition, so they come across as, "Well, y'know, things could go well as well!"
Among people working on existential risks and global catastrophic risks, and I guess AI in particular, there’s this trade-off between not wanting to do things that other people disagree with or are unenthusiastic about, and at the same time not wanting to have a field that’s so conservative that there are no experiments done unless there is a consensus behind them. Do you think people are too inclined to make unilateralist-curse type mistakes, or not trying things enough?
Paul Christiano: I think my answer to this probably varies depending on the area. For reference, I think the policy you want to be following is: update on the fact that no one else wants to do this thing and then take that really seriously, engage with it a lot before deciding whether you want to do it. And ideally that’s going to involve engaging with the people who have made that decision to understand where they’re coming from.
I think I don’t have a very strong general sense of whether we’re more likely to make one mistake or the other. I think I’d expect the world systematically to make too much of the "a thing can be done unilaterally, so it gets done" one. In the context of this field, I don’t know if there are as many – yeah, I guess I don’t feel super concerned about either failure mode. Maybe I don’t feel that bad about where people are at.