The case for aligning narrowly superhuman models

[-]Rob Bensinger5yΩ19340

I've copied over comments by MIRI's Evan Hubinger and Eliezer Yudkowsky on a slightly earlier draft of Ajeya's post — as a separate post, since it's a lot of text.

[-]abramdemski5yΩ8260

This isn't an objection to the research direction, just a response to how you're framing it:

If you think GPT-3 is "narrowly superhuman" at medical advice, what topic don't you think it's narrowly superhuman in? It seems like you could similarly argue that GPT-3 knows more than the average human about mechanics, chemistry, politics, and just about anything that language is good at describing. (EG, not walking, riding a bike, the concrete skills needed for painting, etc.)

A tool capable of getting GPT-3 to give good medical advice would, probably, be a tool to get GPT-3 to give good advice.

(I am not denying that give good medical advice is a better initial goal/framing.)

This seems to imply that GPT-3 is broadly superhuman, IE, GPT-3 knows more than the average human about a very broad range of things (although GPT-3 might not know more than the best human in any domain). Going further: the implication is that GPT is a kind of mild superintelligence, currently misaligned in a benign way (it just wants to mimic humans) which hides an unknown portion of its intelligence (making it seem subhuman).

I'm not saying this is exactly true. Maybe GPT-3 really is only narrowly superhuman, in the s... (read more)

4magfrump5y

This seems like it's using the wrong ontology to me. Like, in my mind, there are things like medical diagnostics or predictions of pharmaceutical reactions, which are much easier cognitive tasks than general conversation, but which humans are specialized away from. For example, imagine the severity of side effects from a specific medication. can be computed by figuring out 15 variables about the person and putting them into a neural network with 5000 parameters, and the output is somewhere in a six-dimensional space, and this model is part of a general model of human reactions to chemicals. Then GPT-3 would be in a great position to use people's reddit posts talking about medication side effects to find this network. I doubt that medical science in our current world could figure that out meaningfully. It would be strongly superhuman in this important medical task, but nowhere near superhuman in any other conversational task. My intuition is that most professional occupations are dominated by problems like this, that are complex enough that we as humans can only capture them as intuitions, but simple enough that the "right" computational solution would be profoundly superhuman in that narrow domain, without being broadly superhuman in any autonomous sense. Maybe a different reading of your comment is something like, there are so many of these things that if a human had access to superhuman abilities across all these individual narrow domains, that human could use it to create a decisive strategic advantage for themself, which does seem possibly very concerning.

6abramdemski5y

Let's see if I can properly state the nature of the disagreement. I stated that there's a spectrum between "GPT knows more than the average human across a broad variety of domains, but only uses this knowledge to imitate humans, so it's not obvious" and "GPT really knows very little, and its apparent stupidity is stupidity-in-fact". I somewhat operationalized the difference as one of internal representation: to what extent is GPT using a truth+noise model (where it knows a lot of stuff about reality, and then filters it through the biases of particular perspectives) vs a model where everything is thrown together and it's not very possible to extract truth without having more information yourself to know what is truth vs noise. This model has an implication, that Ajeya's project will work to the extent that we're toward the smart-GPT end of the spectrum and won't work to the extent that we're toward the other end. I think you're disagreeing with this implication? So you're saying: even if GPT doesn't internally use anything like a truth+noise model, it's possible to extract a great deal of useful information about the world by observing the statistics of GPT's imitation of internet users. For example, because people talk a lot about diseases online, it should be possible to extract statistics about this from GPT. This can produce a useful diagnostic model, even if GPT isn't internally representing something so useful. Is this roughly what you are saying? If that's what you're saying, then I agree that such a thing could be possible, but I am unsure if this should count as success in Ajeya's terms. If GPT knows a lot of stuff but isn't telling us because it's not trying to be helpful, that's misalignment. Getting it to try to communicate those things to us would be a kind of alignment work. If the statistics of GPT's text model can be used to infer useful things about the world, this doesn't seem related to alignment. But maybe I'm totally mis-identifying t

6magfrump5y

I think this is obscuring (my perception of) the disagreement a little bit. I think what I'm saying is, GPT-3 probably doesn't have any general truth+noise models. But I would expect it to copy a truth+noise model from people, when the underlying model is simple. I then expect GPT-3 to "secretly" have something like an interesting diagnostic model, and probably a few other narrowly superhuman skills. But I would expect it to not have any kind of significant planning capacity, because that planning capacity is not simple. In particular my expectation is that coherently putting knowledge from different domains together in generally useful ways is MUCH, MUCH harder than being highly superhuman in narrow domains. Therefore I expect Ajeya's approach to be both effective, because "narrowly superhuman" can exist, and reasonably safe, because the gap between "narrowly superhuman" or even "narrowly superhuman in many ways" and "broadly superhuman" is large so GPT-3 being broadly superhuman is unlikely. Phrased differently, I am rejecting your idea of smartness-spectrum. My intuition is that levels of GPT-N competence will scale the way computers have always scaled at AI tasks--becoming usefully superhuman at a few very quickly, while taking much much longer to exhibit the kinds of intelligence that are worrying, like modeling human behavior for manipulation.

6abramdemski5y

Thanks for trying further to bridge the gap! (It would be nice if you flagged a little better which things you think I think / which things you think I disagree with) OK, that makes sense. So you're not saying that GPT contains useful diagnostic models in the overall statistics of its models of Reddit users (EG that someone complaining of one symptom will often complain of another), nor are you saying that GPT contains a good model of disease which it then feeds through noise (EG it decides that a particular user is a diabetic, which shapes how it plays that character going forward, but the character itself doesn't know it is diabetic, so may say some confused things); indeed, you are denying the latter. But what you are saying is that GPT plays the role of users who do have their own internal models, so it must mimic those models (in cases where that's not too hard to learn). I find this hard to square with your earlier statement: Where it sounds like you think GPT will know something medical science does not know. As for me, I find all of these to be broadly possible. I'd have to think more to give a meaningful plausibility ranking. How many? I am thinking of "medical diagnostics" as just one example of many many areas of expertise which border on GPT's competence. I wasn't thinking there was any special reason to single out medicine in particular as something GPT might have implicit knowledge about. On my model, if GPT contains implicit medical competence, it probably contains similar competence in "every area", although I'm not sure how to quantify. Maybe a similar hidden competence in at least 50% of professions at least as numerous as, say, physicist? (Really, what matters is how much discussion of a profession there is online, not how numerous that profession is, but maybe it's an OK proxy.) My crux would be something special about medical diagnosis such that we especially expect GPT to have implicit talent there. It seems like you think planning cap

4magfrump5y

I'm replying on my phone right now because I can't stop thinking about it. I will try to remember to follow up when I can type more easily. I think the vague shape of what I think I disagree about is how dense GPT-3's sets of implicit knowledge are. I do think we agree that GPT-5000 will be broadly superhuman, even if it just has a grab bag of models in this way, for approximately the reasons you give. I'm thinking about "intelligent behavior" as something like the set of real numbers, and "human behavior" as covering something like rational numbers, so we can get very close to most real numbers but it takes some effort to fill in the decimal expansion. Then I'm thinking of GPT-N as being something like integers+1/N. As N increases, this becomes close enough to the rational numbers to approximate real numbers, and can be very good at approximating some real numbers, but can't give you incomputable numbers (unaligned outcomes) and usually won't give you duplicitous behavior (numbers that look very simple at first approximation but actually aren't, like .2500000000000004, which seems to be 1/4 but secretly isn't). I'm not sure where that intuition comes from but I do think I endorse it with moderate confidence. Basically I think for minimal circuit reasons that if "useful narrowly" emerges in GPT-N, then "useful in that same domain but capable of intentionally doing a treacherous turn" emerges later. My intuition is that this won't be until GPT-(N+3) or more, so if you are able to get past unintentional turns like "the next commenter gives bad advice" traps, this alignment work is very safe, and important to do as fast as possible (because attempting it later is dangerous!) In a world where GPT-(N+1) can do a treacherous turn, this is very dangerous, because you might accidentally forget to check if GPT-(N-1) can do it, and get the treacherous turn. My guess is that you would agree that "minimal circuit that gives good advice" is smaller than "circuit that gives

7abramdemski5y

There was indeed a post posing this question a while back, and discussion in the comments included a counterexample: a construction of a minimal circuit that would be malign. To my eye, the whole crux of the inner alignment problem is that we have no results saying things like: * The simplest program which solves a problem is not an inner optimizer * The minimal circuit which solves a problem is not an inner optimizer * The fastest program solving a problem is not an inner optimizer Or any such thing. If we had such a result, then we'd have a grip on the problem. But we don't currently have any result like that, nor any plausible direction for proving such a result. And indeed, thought on the problem suggests that these hypotheses are probably not true; rather, it seems surprisingly plausible, once you think about it, that indeed minimal solutions may sometimes be inner optimizers. My thinking is that it's probably somewhere between the two. Multiplicative complexity suggests memorizing a lookup table. But there is regularity in the universe. There is transfer learning. Right. I think transfer learning speaks pretty strongly against this multiplicative model.

2magfrump5y

Looks like the initial question was here and a result around it was posted here. At a glance I don't see the comments with counterexamples, and I do see a post with a formal result, which seems like a direct contradiction to what you're saying, though I'll look in more detail. Coming back to the scaling question, I think I agree that multiplicative scaling over the whole model size is obviously wrong. To be more precise, if there's something like a Q-learning inner optimizer for two tasks, then you need the cross product of the state spaces, so the size of the Q-space could scale close-to-multiplicatively. But the model that condenses the full state space into the Q-space scales additively, and in general I'd expect the model part to be much bigger--like the Q-space has 100 dimensions and the model has 1 billion parameters, so going adding a second model of 1 billion parameters and increasing the Q-space to 10k dimensions is mostly additive in practice, even if it's also multiplicative in a technical sense. I'm going to update my probability that "GPT-3 can solve X, Y implies GPT-3 can solve X+Y," and take a closer look at the comments on the linked posts. This also makes me think that it might make sense to try to find simpler problems, even already-mostly-solved problems like Chess or algebra, and try to use this process to solve them with GPT-2, to build up the architecture and search for possible safety issues in the process.

4abramdemski5y

If you mean to suggest this post has a positive result, then I think you're just mis-reading it; the key result is which says that under some assumptions, there exists a task for which the minimal circuit will engage in deceptive behavior (IE is a malign inner optimizer). The comment with a counterexample on the original post is here.

2magfrump5y

I see, I definitely didn't read that closely enough.

4Ajeya Cotra5y

Yeah, you're definitely pointing at an important way the framing is awkward. I think the real thing I want to say is "Try to use some humans to align a model in a domain where the model is better than the humans at the task", and it'd be nice to have a catchy term for that. Probably a model which is better than some humans (e.g. MTurkers) at one task (e.g. medical advice) will also be better than those same humans at many other tasks (e.g. writing horror stories); but at the same time for each task, there's some set of humans (e.g. doctors in the first case and horror authors in the second) where the model does worse. I don't want to just call it "align superhuman AI today" because people will be like "What? We don't have that", but at the same time I don't want to drop "superhuman" from the name because that's the main reason it feels like "practicing what we eventually want to do." I considered "partially superhuman", but "narrowly" won out. I'm definitely in the market for a better term here.

6abramdemski5y

One response I generated was, "maybe it's just not so much about practicing what we eventually want to do, and that part is an illusion of the poor framing. We should figure out the right framing first and then ask whether it seems like practice, not optimize the framing to make it sound like practice." But I think my real response is: why is the superhuman part important, here? Maybe what's really important is being able to get answers (eg medical advice) without putting them in (eg without fine-tuning on medical advice filtered for high quality), and asking for superhuman ability is just a way of helping ensure that? Or perhaps more generally, there are other things like this which you expect people to do wrong if they're not dealing with a superhuman case, because you want the technology to eventually work for superhuman cases.

[-]Ajeya Cotra5yΩ7100

In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn't trying here to make something different sound like it's about practice. I don't think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I'd be similarly excited about or maybe more excited about.

In my mind, the "better than evaluators" part is kind of self-evidently intriguing for the basic reason I said in the post (it's not obvious how to do it, and it's analogous to the broad, outside view conception of the long-run challenge which can be described in one sentence/phrase and isn't strongly tied to a particular theoretical framing):

I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice t

... (read more)

4abramdemski5y

I might be on board if "narrowly superhuman" were simply defined differently. Isn't it something more like "the model has information sufficient to do better"? EG, in the GPT example, you can't reliably get good medical advice from it right now, but you strongly suspect it's possible. That's a key feature of the whole idea, right? Is your suggested research program better described as: find (highly capable) models with inaccessible information and get them to reveal that information? (Especially: get them to reveal the inaccessible information without using domain expertise to do so?)

1Ajeya Cotra5y

I don't feel confident enough in the frame of "inaccessible information" to say that the whole agenda is about it. It feels like a fit for "advice", but not a fit for "writing stories" or "solving programming puzzles" (at least not an intuitive fit -- you could frame it as "the model has inaccessible information about [story-writing, programming]" but it feels more awkward to me). I do agree it's about "strongly suspecting it has the potential to do better than humans" rather than about "already being better than humans." Basically, it's about trying to find areas where lackluster performance seems to mostly be about "misalignment" rather than "capabilities" (recognizing those are both fuzzy terms).

2abramdemski5y

Right, ok, I like that framing better (it obviously fits, but I didn't generate it as a description before).

[-]Ben Pace5y*Ω11250

This was a very solid post and I've curated it. Here are some of the reasons:

I think that the post is a far more careful analysis of questions around what research to do, what research is scalable, and what are the potential negative effects, than most any other proposals I've seen, whilst also containing clear ideas and practical recommendations. (Many posts that optimize for this level of carefulness end up not saying much at all, or at least little of any practical utility, yet this post says quite a lot of interesting things that are practically useful.) There kx a lot of valuable advice, not merely to try to help making narrow superhuman models useful, but how to do it in a way that is helpful for alignment. The section "What kind of projects do and don't "count"" is really helpful here.
I appreciate the efforts that Ajeya has made to understand and build consensus around these ideas, talking to people at various orgs (OpenAI, MIRI, more), and this again makes me feel more confident signal-boosting it, given that it contains information about many others' perspectives on the topic. And more broadly, the whole "Objections and responses" section felt like it did a great job at pe

... (read more)

[-]johnswentworth5yΩ16240

First and foremost, great post! "How do we get GPT to give the best health advice it can give?" is exactly the sort of thing I think about as a prototypical (outer) alignment problem. I also like the general focus on empirical directions and research-feedback mechanisms, as well as the fact that the approach could produce real economic value.

Now on to the more interesting part: how does this general strategy fail horribly?

If we set aside inner alignment and focus exclusively on outer alignment issues, then in-general the failure mode which I think is far and away most likely is roughly "you get what you can measure" or "you get something designed to look good to human supervisors without actually being good". In other words, the inability of humans to reliably/robustly evaluate outcomes is the big problem. (The Fusion Power Generator Scenario is a one good example of the type of failure I'm talking about here - the human doesn't understand what-they-want at a detailed enough level to even ask the right questions, let alone actually evaluate a design.)

So: I expect any version of "align narrowly superhuman models" which evaluates the success of the project entirely by human feedback ... (read more)

[-]Ajeya Cotra5yΩ12190

Thanks for the comment! Just want to explicitly pull out and endorse this part:

the experts be completely and totally absent from the training process, and in particular no data from the experts should be involved in the training process

I should have emphasized that more in the original post as a major goal. I think you might be right that it will be hard to solve the "sandwich" problem without conceptual progress, but I also think that attempts to solve the sandwich problem could directly spur that progress (not just reveal the need for it, but also take steps toward finding actual algorithms in the course of doing one of the sandwich problems).

I also broadly agree with you that "things looking good to humans without actually being good" is a major problem to watch out for. But I don't think I agree that the most impressive-looking results will involve doing nothing to go beyond human feedback: successfully pulling off the sandwich method would most likely look significantly more impressive to mainstream ML researchers than just doing human feedback. (E.g., one of the papers I link in the post is a mainstream ML paper amplifying a weak training signal into a better one.)

6johnswentworth5y

I partially agree with this; alignment is a bottleneck to value for GPT, and actually aligning it would likely produce some very impressive stuff. My disagreement is that it's a lot easier to make something which looks impressive than something which solves a Hard problem (like the sandwich problem), and therefore most impressive-looking "solutions" will probably circumvent the key part of the problem. And if the Hard problem is indeed hard enough to not be solved by anyone, the most impressive-looking results will be those which look good without actually solving it.

6Ajeya Cotra5y

I guess the crux here is "And if the Hard problem is indeed hard enough to not be solved by anyone," — I don't think that's the default/expected outcome. There hasn't been that much effort on this problem in the scheme of things, and I think we don't know where it ranges from "pretty easy" to "very hard" right now.

[-]johnswentworth5yΩ11231

Ah... I think we have an enormous amount of evidence on very-similar problems.

For instance: consider a lawyer and a business owner putting together a contract. The business owner has a rough intuitive idea of what they want, but lacks expertise on contracts/law. The lawyer has lots of knowledge about contracts/law, but doesn't know what the business owner wants. The business owner is like our non-expert humans; the lawyer is like GPT.

In this analogy, the analogue of an expert human would be a business owner who is also an expert in contracts/law. The analogue of the "sandwich problem" would be to get the lawyer + non-expert business-owner to come up with a contract as good as the expert business-owner would. This sort of problem has been around for centuries, and I don't think we have a good solution in practice; I'd expect the expert business-owner to usually come up with a much better contract.

This sort of problem comes up all the time in real-world businesses. We could just as easily consider a product designer at a tech startup (who knows what they want but little about coding), an engineer (who knows lots about coding but doesn't understand what the designer wants), versus a p... (read more)

5Rohin Shah5y

One approach is to let the human giving feedback think for a long time. Maybe the business owner by default can't write a good contract, but a business owner who could study the relevant law for a year would do just as well as the already expert business-owner. In the real world this is too expensive to do, but there's hope in the AI case (e.g. that's a hope behind iterated amplification).

5johnswentworth5y

How does iterated amplification achieve this? My understanding was that it simulates scaling up the number of people (a la HCH), not giving one person more time.

5Rohin Shah5y

Yeah, sorry, that's right, I was speaking pretty loosely. You'd still have the same hope -- maybe a team of 2^100 copies of the business owner could draft a contract just as well, or better than, an already expert business-owner. I just personally find it easier to think about "benefits of a human thinking for a long time" and then "does HCH get the same benefits as humans thinking for a long time" and then "does iterated amplification get the same benefits as HCH".

7johnswentworth5y

Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from??? Both you and Ajeya apparently have this idea, so presumably it was in the water at some point? Yet I don't see any reason at all to expect it to do anything remotely similar to that.

7Ajeya Cotra5y

The intuition for it is something like this: suppose I'm trying to make a difficult decision, like where to buy a house. There are hundreds of cities I'd be open to, each one has dozens of neighborhoods, and each neighborhood has dozens of important features, like safety, fun things to do, walkability, price per square foot, etc. If I had a long time, I would check out each neighborhood in each city in turn and examine how it does on each dimension, and pick the best neighborhood. If I instead had an army of clones of myself, I could send many of them to each possible neighborhood, with each clone examining one dimension in one neighborhood. The mes that were all checking out different aspects of neighborhood X can send up an aggregated judgment to a me that is in charge of "holistic judgment of neighborhood X", and the mes that focus on holistic judgments of neighborhoods can do a big pairwise bracket to filter up a decision to the top me.

7johnswentworth5y

I see, so it's basically assuming that problems factor.

7Ajeya Cotra5y

Yeah, in the context of a larger alignment scheme, it's assuming that in particular the problem of answering the question "How good is the AI's proposed action?" will factor down into sub-questions of manageable size.

6Raemon5y

I had formed an impression that the hope was that the big chain of short thinkers would in fact do a good enough job factoring their goals that it would end up comparable to one human thinking for a long time (and that Ought was founded to test that hypothesis)

9paulfchristiano5y

That's what I have in mind. If all goes well you can think of it like "a human thinking a long time." We don't know if all will go well. It's also not really clear what "a human thinking 10,000 years" means, HCH is kind of an operationalization of that, but there's a presumption of alignment in the human-thinking-a-long-time that we don't get for free here. (Of course you also wouldn't get it for free if you somehow let a human live for 10,000 years...)

5adamShimi5y

Well, Paul's original post presents HCH as the specification of a human enlightened judgement. And if we follow the links to Paul's previous post about this concept, he does describe his ideal implementation of considered judgement (what will become HCH) using the intuition of thinking for decent amount of time. So it looks to me like "HCH captures the judgment of the human after thinking from a long time" is definitely a claim made in the post defining the concept. Whether it actually holds is another (quite interesting) question that I don't know the answer. A line of thought about this that I explore in Epistemology of HCH is the comparison between HCH and CEV: the former is more operationally concrete (what I call an intermediary alignment scheme), but the latter can directly state the properties it has (like giving the same decision that the human after thinking for a long time), whereas we need to argue for them in HCH.

5Rohin Shah5y

I agree with the other responses from Ajeya / Paul / Raemon, but to add some more info: ... I don't really know. My guess is that I picked it up from reading giant comment threads between Paul and other people. Tbc it doesn't need to be literally true. The argument needed for safety is something like "a large team of copies of non-expert agents could together be as capable as an expert". I see the argument "it's probably possible for a team of agents to mimic one agent thinking for a long time" as mostly an intuition pump for why that might be true.

5johnswentworth5y

"As capable as an expert" makes more sense. Part of what's confusing about "equivalent to a human thinking for a long time" is that it's picking out one very particular way of achieving high capability, but really it's trying to point to a more-general notion of "HCH can solve lots of problems well". Makes it sound like there's some structural equivalence to a human thinking for a long time, which there isn't.

6Rohin Shah5y

Yes, I explicitly agree with this, which is why the first thing in my previous response was

3Ajeya Cotra5y

My understanding is that HCH is a proposed quasi-algorithm for replicating the effects of a human thinking for a long time.

[-]johnswentworth5yΩ5110

HCH is more like an infinite bureaucracy. You have some underlings who you can ask to think for a short time, and those underlings have underlings of their own who they can ask to think for a short time, and so on. Nobody in HCH thinks for a long time, though the total thinking time of one person and their recursive-underlings may be long.

(This is exactly why factored cognition is so important for HCH & co: the thinking all has to be broken into bite-size pieces, which can be spread across people.)

1Ajeya Cotra5y

Yes sorry — I'm aware that in the HCH procedure no one human thinks for a long time. I'm generally used to mentally abstracting HCH (or whatever scheme fits that slot) as something that could "effectively replicate the benefits you could get from having a human thinking a long time," in terms of the role that it plays in an overall scheme for alignment. This isn't guaranteed to work out, of course. My position is similar to Rohin's above:

7Razied5y

There are plenty of problems where evaluating a solution is way way easier than finding the solution. I'm doubtful that the model could somehow produce a "looks good to a human but doesn't work" solution to "what is a room-temperature superconductor?". I agree that for biological problems the issue is much more concerning, and certainly for any kind of societal problem, but as long as we stay close to math, physics and chemistry, "looks good to a human" and "works" are pretty closely related to each other.

2Charlie Steiner5y

Hm, interesting, I'm actually worried about a totally different implication of "you get what you can measure." E.g.: "If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think about training an algorithm that tries to learn from anti-abortion folks and output pro-choice responses? Suppose you then run that same algorithm on the experts and it gives outputs in favor of legalizing infanticide - are the humans allowed to say "hold on, I don't want that," or are we just going to accept that as what peak performance looks like? So anyhow I'm pessimistic about sandwiching for moral questions." I'm curious if the upvote disparity means I'm the minority position here :P

[-]johnswentworth5yΩ11250

I think one argument running through a lot of the sequences is that the parts of "human values" which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as "moral questions". Like, these examples from your comment below:

Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?

If an AGI is hung up on these sorts of questions, then we've already mostly-won. That's already an AI which is unlikely to wipe out the human species as a side-effect of maximizing the number of paperclips in the universe. It's already an AI which is unlikely to induce a heart attack in its user in hopes that the user falls onto the positive feedback button. It's already an AI which is unlikely to flood a room in order to fill a cauldron with water.

The vast majority of human values are not things we typically think of as "moral questions"; they're things which are so obvious that we usually don't even think of them until they're pointed out.... (read more)

2Charlie Steiner5y

I'd say "If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human 'moral experts' are going to disagree about], then we've already mostly-won" is an accurate correlation, but doesn't stand up to optimization pressure. We can't mostly-win just by fine-tuning a language model to do moral discourse. I'd guess you agree? Anyhow, my point was more: You said "you get what you can measure" is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said "you get what you measure" is a problem because humans can disagree when their values are 'measured' without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).

2johnswentworth5y

Uh... yeah, I agree with that statement, but I don't really see how it's relevant. If we tune a language model to do moral discourse, then won't it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like "they said they want fusion power, but they probably also want it to not be turn-into-bomb-able". Or are you using "moral discourse" in a broader sense? I disagree with the exact phrasing "fact of the matter for whether decisions are good or bad"; I'm not supposing there is any "fact of the matter". It's hard enough to figure out, just for one person (e.g. myself), whether a given decision is something I do or do not want. Other than that, this is a good summary, and I generally agree with the-thing-you-describe-me-as-saying and disagree with the-thing-you-describe-yourself-as-saying. I do not think that values-disagreements between humans are a particularly important problem for safe AI; just picking one human at random and aligning the AI to what that person wants would probably result in a reasonably good outcome. At the very least, it would avert essentially-all of the X-risk.

2TurnTrout5y

English sentences don't have to hold up to optimization pressure, our AI designs do. If I say "I'm hungry for pizza after I work out", you could say "that doesn't hold up to optimization pressure - I can imagine universes where you're not hungry for pizza", it's like... okay, but that misses the point? There's an implicit notion here of "if you told me that we had built AGI and it got hung up on exotic moral questions, I would expect that we had mostly won." Perhaps this notion isn't obvious to all readers, and maybe it is worth spelling out, but as a writer I do find myself somewhat exhausted by the need to include this kind of disclaimer. Furthermore, what would be optimized in this situation? Is there a dissatisfaction genie that optimizes outcomes against realizations technically permitted by our English sentences? I think it would be more accurate to say "this seems true in the main, although I can imagine situations where it's not." Maybe this is what you meant, in which case I agree.

[-]TurnTrout5yΩ5130

Impression before reading LW post comments & MIRI comments: this strikes me as a valuable "fourth area" of core research that we could start growing now. I'm uncertain about the technical fruits of the research itself (I expect it to be somewhere between 'slightly positive' and 'moderate-high positive'), but it seems like we could indeed scale such research into its own healthy (& prestigious!) subfield in ML. This could diversify the alignment research portfolio in a way that scales sublinearly with long-termist research input: in the long run, we wouldn't need everyone involved to be 'core' alignment researchers.

I have a few notes of unease that I haven't yet sat down to figure out yet, so I may reply to this comment with more thoughts.

[-]John Schulman4yΩ10110

Super clear and actionable -- my new favorite post on AF.

I also agree with it, and it's similar to what we're doing at OpenAI (largely thanks to Paul's influence).

[-]Rohin Shah5yΩ7100

Planned summary for the Alignment Newsletter:

One argument against work on AI safety is that [it is hard to do good work without feedback loops](https://www.jefftk.com/p/why-global-poverty). So how could we get feedback loops? The most obvious approach is to actually try to align strong models right now, in order to get practice with aligning models in the future. This post fleshes out what such an approach might look like. Note that I will not be covering all of the points mentioned in the post; if you find yourself skeptical you may want to read the full post as your question might be answered there.
The author specifically suggests that we work on **aligning narrowly superhuman models** to make them more useful. _Aligning_ a model roughly means harnessing the full capabilities of the model and orienting these full capabilities towards helping humans. For example, GPT-3 presumably “knows” a lot about medicine and health. How can we get GPT-3 to apply this knowledge as best as possible to be maximally useful in answering user questions about health?
_Narrowly superhuman_ means that the model has more knowledge or “latent capability” than either its overseers or its users. In the exam

... (read more)

[-]jungofthewon5yΩ590

This is exactly what Ought is doing as we build Elicit into a research assistant using language models / GPT-3. We're studying researchers' workflows and identifying ways to productize or automate parts of them. In that process, we have to figure out how to turn GPT-3, a generalist by default, into a specialist that is a useful thought partner for domains like AI policy. We have to learn how to take feedback from the researcher and convert it into better results within session, per person, per research task, across the entire product. Another spin on it: w... (read more)

[-]CronoDAS5y90

Someone on Reddit managed to successfully get GPT-3 to guess the solution to his mystery story, which none of the human readers had figured out yet.

[-]David_Kristoffersson5y80

The amount of effort going into AI as a whole ($10s of billions per year) is currently ~2 orders of magnitude larger than the amount of effort going into the kind of empirical alignment I’m proposing here, and at least in the short-term (given excitement about scaling), I expect it to grow faster than investment into the alignment work.

There's a reasonable argument (shoutout to Justin Shovelain) that the risk is that work such as this done by AI alignment people will be closer to AGI than the work done by standard commercial or academic research, and th... (read more)

4Ajeya Cotra5y

I'm personally skeptical that this work is better-optimized for improving AI capabilities than other work being done in industry. In general, I'm skeptical of perspectives that work that the rationalist/EA/alignment crowd does Pareto-dominates the other work going on -- that is, that it's significantly better for both alignment and capabilities than standard work, such that others are simply making a mistake by not working on it regardless of what their goals are or how much they care about alignment. I think sometimes this could be the case, but I wouldn't bet on it being a large effect. In general, I expect work optimized to help with alignment to be worse on average at pushing forward capabilities, and vice versa.

[-]David Scott Krueger (formerly: capybaralet)5yΩ470

I haven't read this in detail (hope to in the future); I only skimmed based on section headers.
I think the stuff about "what kinds of projects count" and "advantages over other genres" seem to miss an important alternative, which is to build and study toy models of the phenomena we care about. This is a bit like the gridworlds stuff, but I thought the description of that work missed its potential, and didn't provide much of an argument for why working at scale would be more valuable.

This approach (building and studying toy models) is popular in ML re... (read more)

4Ajeya Cotra5y

The case in my mind for preferring to elicit and solve problems at scale rather than in toy demos (when that's possible) is pretty broad and outside-view, but I'd nonetheless bet on it: I think a general bias toward wanting to "practice something as close to the real thing as possible" is likely to be productive. In terms of the more specific benefits I laid out in this section, I think that toy demos are less likely to have the first and second benefits ("Practical know-how and infrastructure" and "Better AI situation in the run-up to superintelligence"), and I think they may miss some ways to get the third benefit ("Discovering or verifying a long-term solution") because some viable long-term solutions may depend on some details about how large models tend to behave. I do agree that working with larger models is more expensive and time-consuming, and sometimes it makes sense to work in a toy environment instead, but other things being equal I think it's more likely that demos done at scale will continue to work for superintelligent systems, so it's exciting that this is starting to become practical.

1David Scott Krueger (formerly: capybaralet)5y

Thanks for the response! I see the approaches as more complimentary. Again, I think this is in keeping with standard/good ML practice. A prototypical ML paper might first describe a motivating intuition, then formalize it via a formal model and demonstrate the intuition in that model (empirically or theoretically), then finally show the effect on real data. The problem with only doing the real data (i.e. at scale) experiments is that it can be hard to isolate the phenomena you wish to study. And so a positive result does less to confirm the motivating intuition, as there are many other factors as play that might be responsible. We've seen this happen rather a lot in Deep Learning and Deep RL, in part because of the focus on empirical performance over a more scientific approach.

[-]adamShimi5y*Ω250

Thanks for the very in-depth case you're making! I especially liked the parts about the objections, and your take on some AI Alignment researcher's opinions of this proposal.

Personally, I'm enthusiastic about it with caveats expanded below. If I try to interpret your proposal according to the lines of my recent epistemological framing of AI Alignment research, you're pushing for a specific kind of work on the Solving part of the field, where you assume a definition of the terms of the problem (what AIs will we build and what do we want). My caveats can be ... (read more)

[-]Charlie Steiner3yΩ240Review for 2021 Review

This was an important and worthy post.

I'm more pessimistic than Ajeya; I foresee thorny meta-ethical challenges with building AI that does good things and not bad things, challenges not captured by sandwiching on e.g. medical advice. We don't really have much internal disagreement about the standards by which we should judge medical advice, or the ontology in which medical advice should live. But there are lots of important challenges that are captured by sandwiching problems - sandwiching requires advances in how we interpret human feedback, and how we tr... (read more)

[-]Richard_Ngo5yΩ240

Nice post. The one thing I'm confused about is:

Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured).

It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objecti... (read more)

[-]Ajeya Cotra5yΩ6120

We're simply not sure where "proactively pushing to make more of this type of research happen" should rank relative to other ways we could spend our time and money right now, and determining that will involve thinking about a lot of things that are not covered in this post (most importantly what the other opportunities are for our time and money).

already seen as a standard way to make progress on the full alignment problem

It might be a standard way to make progress, but I don't feel that this work has been the default so far — the other three types of research I laid out seem to have absorbed significantly more researcher-hours and dollars among people concerned with long-term AI risk reduction. (It's possible that human feedback is more common among people motivated by profit, but I doubt that because it doesn't seem that profitable yet.)

Also, if we use a stricter definition of "narrowly superhuman" (i.e. the model should be capable of outperforming the evaluations — not just the demonstrations — of the humans training it), I'd argue that there hasn't been any work published on that so far.

7Rohin Shah5y

It's important to distinguish between: * "We (Open Phil) are not sure whether we want to actively push this in the world at large, e.g. by running a grant round and publicizing it to a bunch of ML people who may or may not be aligned with us" * "We (Open Phil) are not sure whether we would fund a person who seems smart, is generally aligned with us, and thinks that the best thing to do is reward modeling work" My guess is that Ajeya means the former but you're interpreting it as the latter, though I could easily be wrong about either of those claims.

[-]Quintin Pope5y30

Suppose we want to train GPT-n in to do any of many different goals (give good medical advice, correctly critique an argument, write formal and polite text, etc). We could find training data that demonstrate a possible goal and insert natural language control codes around that data.

E.g., suppose XY is a section of training text. X contains a description of a medical problem. Y gives good medical advice. We would then modify XY to be something like:

[give correct medical advice]X[start]Y[end]

We would then repeat this for as many different goals and for as mu... (read more)

2Vaniver5y

This feels way less secure to me than 'control codes' that use the model internals, since presumably users could submit text with control codes in a way that then causes problems.

1Quintin Pope5y

The control codes could include a special token/sequence that only authorized users can use. Also, if you’re allowing arbitrary untrusted queries to the model, your security shouldn’t depend on model output anyways. Even if attackers can’t use control codes, they can still likely get the model to do what they want via blackbox adversarial search over the input tokens.

[-]Charlie Steiner5yΩ230

I (conceptual person) broadly do agree that this is valuable.

It's possible that we won't need this work - that alignment research can develop AI that doesn't benefit from the same sort of work you'd do to get GPT-3 to do tricks on command. But it's also possible that this really would be practice for "the same sort of thing we want to eventually do."

My biggest concern is actually that the problem is going to be too easy for supervised learning. Need GPT-3 to dispense expert medical advice? Fine-tune it on a corpus of expert medical advice! Or for slightly ... (read more)

1Ajeya Cotra5y

I don't think you can get away with supervised learning if you're holding yourself to the standard of finding fuzzy tasks where the model is narrowly superhuman. E.g. the Stiennon et al., 2020 paper involved using RL from human feedback: roughly speaking, that's how it was possible for the model to actually improve upon humans rather than simply imitating them. And I think in some cases, the model will be capable of doing better than (some) humans' evaluations, meaning that to "get models to the best they can to help us" we will probably need to do things like decomposition, training models to explain their decisions, tricks to amplify or de-noise human feedback, etc. I don't agree that there's obviously conceptual progress that's necessary for moral advice which is not necessary for medical advice — I'd expect a whole class of tasks to require similar types of techniques, and if there's a dividing line I don't think it is going to be "whether it's related to morality", but "whether it's difficult for the humans doing the evaluation to tell what's going on." To answer your question for both medical and moral advice, I'd say the obvious first thought is RL from human feedback, and the second thought I had to go beyond that is trying to figure out how to get less-capable humans to replicate the training signal produced by more-capable humans, without using any information/expertise from the latter to help the former (the "sandwiching" idea). I'm not sure if it'll work out though.

2Charlie Steiner5y

Re: part 1 - Good points, I agree. Though I think you could broadly replicate the summarization result using supervised learning - the hope for using supervised learning in superhuman domains is that your model learns a dimension of variation for "goodness" that can generalize well even if you condition on "goodness" being slightly outside any of the training examples. Re: part 2 - What it boils down to is that my standards (and I think the practical standards) for medical advice are low, while my standards for moral advice are high (as in, you could use this to align AGI). I agree that there's no magic property a moral question has that no medical question could have. But there are non-magical properties I expect to be relevant. With medical advice from a text model, I'm not expecting it to learn a detailed model of the human body and be able to infer new medical conditions and treatments that human experts haven't figured out yet. I'm just expecting it to do verbal reasoning to arrive at the same substantive advice a human expert would give, maybe packaged in a slightly superhuman good explanation. With moral advice, though, ask 3 human experts and you'll get 4 opinions. This is made worse by the fact that I've sneakily increased the size of the problem - "moral advice" can be about almost anything. Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines? Medical advice seems to be in the "supervisable regime," where it's fulfilled its promise by merely telling us things that human experts know. Moral advice is very not, because humans aren't consistent about morality in the same way they can be about medicine. If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think abou

[-]William_S5yΩ220

One easy way to make people who can't solve the task for sandwiching is to take people who could solve the task and then give them insufficient time to solve it, or have them be uninformed of some relevant facts about the specific task they are trying to solve.

A simpler way to measure whether you are making progress towards sandwiching if you can't go there directly is to look at whether you can get people to provide better supervision with your tool than without your tool, that is accomplishing more on the task.

Both of these approaches feel like they aren... (read more)

[-]magfrump5y20

This post matches and specifies some intuitions I've had for a while about empirical research and I'm very happy it has been expanded.

[-]SoerenMind4y10

Google seems to have solved some problem like the above for a multi-language-model (MUM):

"Say there’s really helpful information about Mt. Fuji written in Japanese; today, you probably won’t find it if you don’t search in Japanese. But MUM could transfer knowledge from sources across languages, and use those insights to find the most relevant results in your preferred language."

[-]SoerenMind5y10

How useful would it be to work on a problem where the LM "knows" can not be superhuman but it still knows how to do well and needs to be incentivized to do so? A currently prominent example problem is that LMs produce "toxic" content:
https://lilianweng.github.io/lil-log/2021/03/21/reducing-toxicity-in-language-models.html

[-]William_S5yΩ110

Even better than "Getting models to explain why they’re doing what they’re doing in simpler terms that connect to things the human overseers understand" would be getting models to actually do the task in ways that are simpler and connect to things that human overseers understand. E.g. if a model can solve a task in multiple steps by looking up relevant information by doing internet searches that are recorded and readable by the overseer instead of using knowledge opaquely measured in the weights, that seems like a step in the right direction.

[-]Charlie Sanders5y10

On fuzzy tasks: I think the appropriate frame of comparison is neither an average subset (Mechanical Turk) or the ideal human (Go), but instead the median resource that someone would be reasonably likely to seek out. To use healthcare as an example, you'd want your AI to beat the average family doctor that most people would reach out to, as opposed to either a layman's opinion or the preeminent doctor in the field.

4Charlie Steiner5y

Hello fellow Charlie! For half a second I thought I'd written a comment in a fugue state and forgotten it :P

2Raemon5y

I think that makes sense for "building a useful product", but less so for "test the hypothesis that you can get aligned superhuman performance out of an unaligned-by-default intelligence, for purposes of later being more informed when you go to build an aligned, godlike intelligence."

1Charlie Sanders5y

Right, but I'm not sure how you'd "test" for success in that scenario. Usefulness to humanity, as demonstrated by effective product use, seems to me like the only way to get a rigorous result. If you can't measure the success or failure of an idea objectively, then the idea probably isn't going to matter much.

At least better than some salient large group of humans in a particular context, like “Mechanical Turk workers”, “stackoverflow users”, etc. Right now, models are only superhuman with respect to all humans in particular crisp domains like games. E.g. AlphaGoZero is better at Go than any human; GPT-3 probably has the potential to give better advice than some humans. ↩︎
This idea isn’t original to me -- a number of others (especially some people working on long-term AI alignment at OpenAI and DeepMind) have thought along similar lines. My own thinking about this has been informed a lot by discussions with Paul Christiano and Holden Karnofsky. ↩︎
e.g., Mechanical Turk workers who are hired to give feedback to the model ↩︎
Though if we could pull off a path where we build an AI system that is superhuman in certain engineering capabilities but not yet human-level in modeling and manipulating people, and use that system to cut down on x-risk from other AI projects without having to figure out how to supervise arbitrary superhuman models, that could be really good. ↩︎
Note that I don’t think this is the only way to study interpretability and robustness, or even necessarily the best way. In this project-generation formula, the domain and task were optimized to make reward learning an especially interesting and important challenge, rather than to make interpretability or robustness especially challenging, interesting, or important. I think it’s good to be complete and to try to ensure interpretability and robustness in these domains, but we should probably also do other lines of research which choose domains / tasks that are specifically optimized for interpretability or robustness, rather than reward learning, to be especially challenging and important. ↩︎
Pragmatically speaking, fine-tuning a large model rather than training from scratch is also orders of magnitude cheaper, and so a lot more accessible to most researchers. ↩︎
Another way of seeing why it wouldn’t count is that “predict the next token” is an extremely non-fuzzy training signal. ↩︎
Human contractors make these labels, but they are not providing feedback. ↩︎
More speculatively, if we’re realizing models’ full potential as we go along, there’s less chance of ending up with what I’ll call an “unforced sudden takeoff”: a situation where on some important set of fuzzy tasks models jump suddenly from being not-that-useful to extraordinarily useful, but this was due to not bothering to figure out how to make models useful for fuzzy tasks rather than any inherent underlying fact about models. I’m not sure how plausible an unforced sudden takeoff is though, and I’m inclined (because of efficient market intuitions) to think the strong version of it is not that likely. H/t Owen Cotton-Barratt for this thought. ↩︎
E.g., that whenever there are two or more generalizations equally consistent with the training data so far, models will never generalize in the way that seems more natural or right to humans. ↩︎
I think eventually gridworlds and games will probably fade away as it becomes more practical to work with larger models instead, and dynamics like the treacherous turn start to show up in messier real-world settings. ↩︎
One idea a couple of others have suggested here and which I’m generally interested in is “transparency in (narrowly superhuman) language models”: finding ways to understand “what models are thinking and why,” especially when they know more about something than humans do. I like this idea but am very unsure about what execution could look like. E.g., would it look like Chris Olah’s work, which essentially “does neuroscience” on neural networks? Would it look like training models to answer our questions about what they’re thinking? Something else? ↩︎
Though you could think that in an absolute sense it and all the other approaches that aren’t tackling treachery head-on are doomed. ↩︎
I would also prefer other things being equal that EAs focused on long-run x-risk get the recognition for this work rather than others, but as I said above I consider this secondary and think that this agenda is good on the merits, not just as career capital for EAs. ↩︎
There are some innovators for whom the value of being in an area is strictly decreasing in its crowdedness, because their main value-add is to “start something from nothing.” But I don’t think that applies to most contributors, even those who have an extremely large impact eventually (which might even be larger than the innovators’ impact in some cases). ↩︎
Some people have argued that the “verifying long-run solutions” path is dominant because the other stuff is likely to happen anyway, but I’m not convinced. I think all three paths to impact that I laid out are likely to happen one way or another, and there’s room to speed up or improve all of them. I do think there could be some boost to the “verifying long-run solutions” path, but all in all I feel like it’ll be ⅓ to ¾ of the value, not >90% of the value. ↩︎
The most plausible competing pitch in my mind is “get language models to answer questions honestly”, which seems like it could get at the “ascription universality” / “knowing everything the model knows” concept (h/t Evan H, Owen C-B, Owain E). That would narrow the focus to language models and question-answering, and rule out projects like “get non-coders to train a coding model.” I think the “get language models to answer questions honestly” frame is reasonable and I want to see work done under that banner too, but I’m not convinced it’s superior. It considerably narrows the scope of what’s “in”, cutting down on long-run field growth potential, and I think a lot of the projects that are “out” (like the coding project) could be helpful and informative. I also worry that the tagline of “honesty” will encourage people to focus on “avoiding harmful lies that are nonetheless pretty easy for humans to detect”, rather than focusing on regimes where models exceed human performance (see this objection for more discussion of that). ↩︎
It’s possible other places, like Google Brain or some other FAANG lab, would also have roles available doing this type of work -- I am just more unsure because there is less of a long-termist alignment researcher presence in those places. ↩︎
Eventually, when models are more strongly superhuman, I think it will get too hard to even tell whether outcomes were acceptable, because AI systems could e.g. compromise the cameras and sensors we use to measure outcomes. So relying on outcomes earlier on feels like “kicking the can down the road” rather than “practicing what we eventually want to be good at.” “Don’t kick the can down the road, instead practice what we eventually want to be good at” is the overall ethos/attitude I’m going for with this proposal. ↩︎

LESSWRONG
LW

LESSWRONG
LW

186

The case for aligning narrowly superhuman models

186

Ω 74

186

Ω 74

What aligning narrowly superhuman models could look like

Existing work in this area

What kinds of projects do and don’t “count”

Potential near-future projects: “sandwiching”

How this work could reduce long-term AI x-risk

Advantages over other genres of alignment research

Objections and responses

How would this address treachery by a superintelligence?

Doesn’t this feel suspiciously close to just profit-maximizing?

Isn’t this not neglected because lots of people want useful AI?

Will this cause harm by increasing investment in scaling AI?

Why not just stick with getting models not to do bad things?

Why not focus on testing a candidate long-term solution?

Current state of opinion on this work

Takeaways and possible next steps

Appendix: beyond sandwiching?