Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The predominant view on LW seems to be "pure AI capabilities research is bad, because capabilities progress alone doesn't contribute to alignment progress, and capabilities progress without alignment progress means that we're doomed".

I understand the arguments for this position, but I have what might be called the opposite position. The opposite position seems at least as intuitive as the standard position to me, and it confuses me that it's not discussed more. (I'm not confused that people reject it; I'm confused that nobody seems to even bring it up for the purpose of rejecting it.)

The opposite position is "In order to do alignment research, we need to understand how AGI works; and we currently don't understand how AGI works, so we need to have more capabilities research so that we would have a chance of figuring it out. Doing capabilities research now is good because it's likely to be slower now than it might be in some future where we had even more computing power, neuroscience understanding, etc. than we do now. If we successfully delayed capabilities research until a later time, then we might get a sudden spurt of it and wouldn't have the time to turn our increased capabilities understanding into alignment progress. Thus by doing capabilities research now, we buy ourselves a longer time period in which it's possible to do more effective alignment research."

Some reasons I have for holding this position:

1) I used to do AI strategy research. Among other things, I looked into how feasible it is for intelligence to rapidly turn superintelligent, and what kinds of pathways there are into AI disaster. But a thought that I kept having when doing any such research was "I don't know if any of this theory is of any use, because so much depends on what the world will be like when actual AGI is developed, and what that AGI will look in the first place. Without knowing what AGI will look like, I don't know whether any of the assumptions I'm making about it are going to hold. If any one of them fails to hold, the whole paper might turn out to be meaningless."

Eventually, I concluded that I can't figure out a way to make the outputs of strategy research useful for as long as I know as little about AGI as I do. Then I went to do something else with my life, since it seemed too early to do useful AGI strategy research (as far as I could tell).

2) Compare the state of AI now, to how it was before the deep learning revolution happened. It seems obvious to me that our current understanding of DL puts us in a better position to do alignment research than we were before the DL revolution. For instance, Redwood Research is doing research on language models because they believe that their research is analogous to some long-term problems

Assume that Redwood Research's work will actually turn out to be useful for aligning superintelligent AI. Language models are one of the results of the DL revolution, so their work couldn't have been done before that revolution. It seems that in a counterfactual world where the DL revolution happened later and the DL era was compressed into a shorter timespan, our chances of alignment would be worse since that world's equivalent of Redwood Research would have less time to do their research.

3) As a similar consideration, language models are already "deceptive" in a sense - asked something that it has no clue about, InstructGPT will happily come up with confident-sounding nonsense. When I linked people to some of that nonsense, multiple people pointed out that InstructGPT's answers sound like the kind of a student who's taking an exam and is asked to write an essay about a topic they know nothing about, but tries to fake it anyway (that is, trying to deceive the examiner). 

Thus, even if you are doing pure capabilities research and just want your AI system to deliver people accurate answers, it is already the case that you can see a system like InstructGPT "trying to deceive" people. If you are building a question-answering system, you want to build one that people can trust to give accurate answers rather than impressive-sounding bullshit, so you have the incentive to work on identifying and stopping such "deceptive" computations as a capabilities researcher already.

So it has already happened that

  • Progress in capabilities research gives us a new concrete example of how e.g. deception manifests in practice, that can be used to develop our understanding of it and develop new ideas for dealing with it.
  • Capabilities research reaches a point where even capabilities researchers have a natural reason to care about alignment, reducing the difference between "capabilities research" and "alignment research".
  • Thus, our understanding and awareness of deception is likely to improve as we get closer to AGI, and by that time we will have already learned a lot about how deception manifests in simpler systems and how to deal with it, and maybe some of that will suggest principles that generalize to more powerful systems as well.

It's not that I'd put a particularly high probability on InstructGPT by itself leading to any important insights about either deception in particular or alignment in general. InstructGPT is just an instance of something that seems likely to help us understand deception a little bit better. And given that, it seems reasonable to expect that further capabilities development will also give us small insights to various alignment-related questions, and maybe all those small insights will combine to give us the answers we need.

4) Still on the topic of deception, there are arguments suggesting that something like GPT will always be "deceptive" for Goodhart's Law and Siren World reasons. We can only reward an AI system for producing answers that look good to us, but this incentivizes the system to produce answers that look increasingly good to us, rather than answers that are actually correct. "Looking good" and "being correct" correlate with each other to some extent, but will eventually be pushed apart once there's enough optimization pressure on the "looking good" part.

As such, this seems like an unsolvable problem... but at the same time, if you ask me a question, I can have a desire to actually give a correct and useful answer to your question, rather than just giving you an answer that you find maximally compelling. More generally, humans can and often do have a genuine desire to help other humans (or even non-human animals) fulfill their preferences, rather than just having a desire to superficially fake cooperativeness.

I'm not sure how this desire works, but I don't think you could train GPT to have it. It looks like some sort of theory of mind is involved in how the goal is defined. If I want to help you fulfill your preferences, then I have a sense of what it would mean for your preferences to be fulfilled, and I can have a goal of optimizing for that (even while I am uncertain of what exactly your preferences are).

We don't currently seem to know how to do this kind of a theory of mind, but it can't be that much more complicated than other human-level capabilities are, since even many non-human animals seem to have some version of it. Still, I don't think we can yet implement that kind of a theory of mind in any AI system. So we have to wait for our capabilities to progress to the kind of a point where this kind of a capacity becomes possible, and then we can hopefully use that capabilities understanding to solve what looks like a crucial piece of alignment understanding.

New Comment
28 comments, sorted by Click to highlight new comments since: Today at 2:37 PM

In order to do alignment research, we need to understand how AGI works; and we currently don't understand how AGI works, so we need to have more capabilities research so that we would have a chance of figuring it out.

I totally agree with this. Alas, "understand how AGI works" is not something which most capabilities work even attempts to do.

It turns out that people can advance capabilities without having much clue what's going on inside their magic black boxes, and that's what most capabilities work looks like at this point.

Agreed, but the black-box experimentation seems like it's plausibly a prerequisite for actual understanding? E.g. you couldn't analyze InceptionV1 or CLIP to understand its inner workings before you actually had those models. To use your car engine metaphor from the other comment, we can't open the engine and stick it full of sensors before we actually have the engine. And now that we do have engines, people are starting to stick them full of sensors, even if most of the work is still focused on building even fancier engines.

It seems reasonable to expect that as long as there are low-hanging fruit to be picked using black boxes, we get a lot of black boxes and the occasional paper dedicated to understanding what's going on with them and how they work.  Then when it starts getting harder to get novel interesting results with just black box tinkering, the focus will shift to greater theoretical understanding and more thoroughly understanding everything that we've accomplished so far. 

I think we are getting some information. For example, we can see that token level attention is actually quite powerful for understanding language and also images. We have some understanding of scaling laws. I think the next step is a deeper understanding of how world modeling fits in with action generation -- how much can you get with just world modeling, versus world modeling plus reward/action combined?

If the transformer architecture is enough to get us there, it tells us a sort of null hypothesis for intelligence -- that the structure for predicting sequences by comparing all pairs of elements of a limited sequence -- is general.

Not rhetorically, what kind of questions you think would better lead to understanding how AGI works?

I think teaching a transformer with an internal thought process (predicting the next tokens over a part of the sequence that's "showing your work") would be an interesting insight into how intelligence might work. I thought of this a little while back but also discovered this is also a long standing MIRI research direction into transparency. I wouldn't be surprised if Google took it up at this point.

Not rhetorically, what kind of questions you think would better lead to understanding how AGI works?

Suppose I'm designing an engine. I try out a new design, and it surprises me - it works much worse or much better than expected. That's a few bits of information. That's basically the sort of information we get from AI experiments today.

What we'd really like is to open up that surprising engine, stick thermometers all over the place, stick pressure sensors all over the place, measure friction between the parts, measure vibration, measure fluid flow and concentrations and mixing, measure heat conduction, etc, etc. We want to be able to open that black box, see what's going on, figure out where that surprising performance is coming from. That would give us far more information, and far more useful information, than just "huh, that worked surprisingly well/poorly". And in particular, there's no way in hell we're going to understand how an engine works without opening it up like that.

The same idea carries over to AI: there's no way in hell we're going to understand how intelligence works without opening the black box. If we can open it up, see what's going on, figure out where surprises come from and why, then we get orders of magnitude more information and more useful information. (Of course, this also means that we need to figure out what things to look at inside the black box and how - the analogues of temperatures, pressures, friction, mixing, etc in an engine.)

You can build a good engine without any sensors inside, and indeed people did -- i.e. back in the 19th century when sensors of that sort didn't exist yet. (They had thermometers and pressure gauges, but they couldn't just get any information from any point inside the engine block, like we can by looking at activations in a NN.) What the engineers of the 19th century had, and what we need, is a general theory. For engines, that was thermodynamics. For AI, we need some kind of Theory of Intelligence. The scaling laws might be pointing the way to a kind of thermodynamics of intelligence.

It seems to me that the this argument only makes sense if we assume that “more capabilities research now” translates into “more gradual development of AGI”. That’s the real crux for me.

If that assumption is false, then accelerating capabilities is basically equivalent to having all the AI alignment and strategy researchers hibernate for some number N years, and then wake up and get back to work. And that, in turn, is strictly worse than having all the AI alignment and strategy researchers do what they can during the next N years, and also continue doing work after those N years have elapsed. I do agree that there is important alignment-related work that we can only do in the future, when AGI is closer. I don't agree that there is nothing useful being done right now.

On the other hand, if that assumption is true (i.e. the assumption “more capabilities research now” translates into “more gradual development of AGI”), then there's at least a chance that more capabilities research now would be net positive.

However, I don't think the assumption is true—or at least, not to any appreciable extent. It would only be true if you thought that there was a different bottleneck to AGI besides capabilities research. You mention faster hardware, but my best guess is that we already have a massive hardware overhang—once we figure out AGI-capable algorithms, I believe we already have the hardware that would support superhuman-level AGI with quite modest amounts of money and chips. (Not everyone agrees with me.) You mention “neuroscience understanding”, but I would say that insofar as neuroscience understanding helps people invent AGI-capable learning algorithms, neuroscience understanding = capabilities research! (I actually think some types of neuroscience are mainly helpful for capabilities and other types are mainly helpful for safety, see here.) I imagine there being small bottlenecks that would add a few months today, but would only add a few weeks in a decade, e.g. future better CUDA compilers. But I don't see any big bottlenecks, things that add years or decades, other than AGI capabilities research itself.

Even if the assumption is significantly true, I still would be surprised if more capabilities research now would be a good trade, because (1) I do think there’s a lot of very useful alignment work we can do right now (not to mention outreach, developing pedagogy, etc.), (2) the most valuable alignment work is work that informs differential technological development, i.e. work that tells us exactly what AGI capabilities work should be done at all, namely R&D that moves us down a path to maximally alignable AGI, but that's only valuable to the extent that we figure things out before the wrong kind of capabilities research has already been completed. See Section 1.7 here.

I'm not sure how this desire works, but I don't think you could train GPT to have it. It looks like some sort of theory of mind is involved in how the goal is defined. 

I do think that would be valuable to know, and am very interested in that question myself, but I think that figuring it out is mostly a different type of research than AGI capabilities research—loosely speaking, what you're talking about looks like “designing the right RL reward function”, whereas capabilities research mostly looks like “designing a good RL algorithm”—or so I claim, for reasons here and here.

In some sense I would think it's almost tautologically true that faster capabilities research shortens the timeline in which alignment and strategy researchers do their own work. 

It seems obvious to me that our current understanding of DL puts us in a better position to do alignment research than we were before the DL revolution.

Not at all obvious. I think we barely get insight, at least so far, from DL.

More broadly, capabilities research can be strategically-relevantly different from other capabilities research.

E.g., capability research that is published or likely will be published, adds to the pile of stuff that arbitrary people can use to make AGI. Capability research that will be kept private has much less of this problem.

Capability research can be more or less "about" understanding AGI in a way that leads to understanding how to align it, vs understanding AGI in a way that leads to be able to make it (whether FAI or UFAI). For example, one could pour a bunch of research into building a giant evolution simulator with rich environment and heuristics for skipping ahead, etc. This is capabilities research that seems to me not super likely to go anywhere, but if it does go anywhere, it seems more likely to lead to AGI that's opaque and unalignable by strong default, and even if transparency-type stuff can be bolted on, the evolution-engineering itself doesn't help very much with doing that.

[-]TurnTrout1yΩ7110

Still on the topic of deception, there are arguments suggesting that something like GPT will always be "deceptive" for Goodhart's Law and Siren World reasons. We can only reward an AI system for producing answers that look good to us, but this incentivizes the system to produce answers that look increasingly good to us, rather than answers that are actually correct. "Looking good" and "being correct" correlate with each other to some extent, but will eventually be pushed apart once there's enough optimization pressure on the "looking good" part.

As such, this seems like an unsolvable problem... but at the same time, if you ask me a question, I can have a desire to actually give a correct and useful answer to your question, rather than just giving you an answer that you find maximally compelling. More generally, humans can and often do have a genuine desire to help other humans (or even non-human animals) fulfill their preferences, rather than just having a desire to superficially fake cooperativeness.

This argument proves too much. Humans are rewarded by each other for appearing to be friendly, but not actually for being friendly. Therefore, they are incentivized to seem friendly. Therefore, no humans will really be friendly or really care about each other, because that's not what human culture is really selecting for. 

We have concluded a falsehood; the line of reasoning requires, at the least, more assumptions which make it clear why the argument obtains in the AI case but not the human case. (Personally, I suspect the reasoning is mostly invalid.) So, I think "selection" / "incentivization" arguments require great care and caution

(As another line of reasoning, reward is not the optimization target contravenes claims like "GPT will always be 'deceptive' due to Goodhart's / Siren worlds.")

I tend to value a longer timeline more than a lot of other people do. I guess I see EA and AI Safety setting up powerful idea machines that get more powerful when they are given more time to gear up.  A lot more resources have been invested into EA field-building recently, but we need time for these investments to pay off. At EA London this year, I gained a sense that AI Safety movement building is only now becoming its own thing; and of course it'll take time to iterate to get it right, then time for people to pass through the programs, then time for them to have a career.

I suspect the kind of argument that we need more capabilities to make progress might have been stronger earlier in the game, but now that we already have powerful language models, there's a lot that we can do without needing AI to advance any further.

[-]lc2y60

The predominant view on LW seems to be "pure AI capabilities research is bad, because capabilities progress alone doesn't contribute to alignment progress, and capabilities progress without alignment progress means that we're doomed".

I feel like I need to point out that LessWrong seemed mostly neutral to me on the subject of capabilities research until my post asking why people weren't more negative. I mentioned there that I had literally never heard of someone here suggest the "get people to stop contributing to AI development" strategy informally and I didn't get any counterexamples. As far as I can tell, people just didn't talk a whole lot about the question, to the point that it was easy for people like Elon Musk to lose the plot on the whole "AI safety" thing and go start essentially a capabilities company in response to the problem. People are also just talking more about capabilities now in general, because of Eliezer's doomerism and recent gains giving credence some really short (10-15 year) timelines.

The pessimism post got a lot of upvotes, but it's only been two months since it was made, so there hasn't been a lot of time for people with alternative opinions to present them. In addition, as I tried to make clear with a lot of conditional language, that post only makes sense as an argument for slowing down research if you believe AI is probably going to kill everyone. I made it largely as a response to Eliezer's recent doomerism, and as a pseudo-criticism of how his method of dealing with the problem in practice (namely, doing lots of math and ML research in relative quiet) don't seem to line up with his words. I share a lot of his beliefs, but if you don't agree with them, or think AI will probably be the best thing ever, then the post's conclusions about capabilities research don't necessarily follow. 

Other than that my question to you is similar to what johnswentworth said: to what degree does enhancing the capability of, say, OpenAI's LSTMs actually give alignment-relevant insight into how it works? There was always a trivial sense in which making AGI will teach us about AGI because when we finally turn it on we'll learn how we're doomed. The problem is that those innovations don't by default tend to lead to better understanding of behavior like this; on the contrary, it tends to make these systems even more complicated and unscrutable from a maths, security, and engineering perspective.

I've seen LW have a negative attitude towards capabilities research a lot longer, e.g.

[-]lc2y20

That's people criticizing OpenAI's particularly stupid stated philosophy of giving AI to everyone; what I haven't seen is people criticize capabilities gains or ML researchers in general.

Note also the comments of that post, where people mention that LW has been so negative towards the idea of capabilities progress that multiple ML researchers have been concerned about getting murdered by rationalists and have received literal death threats.

It seems that in a counterfactual world where the DL revolution happened later and the DL era was compressed into a shorter timespan, our chances of alignment would be worse since that world's equivalent of Redwood Research would have less time to do their research.

It seems to me that counterfactually changing the date of the start of the Deep Learning revolution has two impacts: it shortens or lengthens the Deep Learning era, and it accelerates or decelerates the arrival of AGI.

ie If you could have magically gotten Deep Learning to happen earlier, we would have had longer time in the DL era, because there would be more time where people are using the DL paradigm while there was less compute to do it with, and more time for us to learn more about how Deep Learning works. But also, it means there are more researcher-hours going into finding DL techniques, which overall probably speeds up AGI arrival times.

It seems like a (the?) crux here is which of these impacts predominates. How much additional safety progress do you get from marginal knowledge of AI paradigms, vs. how much additional safety progress do you get from additional years to work on the problem. 

Making up some numbers: would we prefer to have another 10 years to work on the problem, in which it is only in the final 2 that we get to see the paradigm in which AGI will be built? Or would we prefer to have 6 years to work on the problem, during all of which we have access to the paradigm that will build AGI?

Suppose you could build an AGI in 1999 or 2009, but the AGI required a specialized, expensive supercomputer to run, and there was only 1-2 of such supercomputers in the world. Also suppose (for the sake of argument) that the AGI couldn't create a botnet of itself using PCs or conventional servers, or that creating such a botnet would not significantly improve the AGI's abilities (<2x improvement). Would that be a better outcome than an AGI that arrives in 2029 and can run on dozens or billions of machines which exist at that time?

Maybe? 

Not having a hardware overhang makes your planet much safer. But it depends on how quickly researchers would develop methods for scaling AGI systems, either by building more supercomputers, or generalizing our code to run on more conventional machines. If this process takes years or decades we get to experiment with AGI in a relatively safe way. But if this step takes months, then I think the world ends in ~ 2000 or ~ 2010 (depending on our AGI arrival date).

There might be a good argument for capability research being good if directed at making more Tool-AI instead of Agent-AI. In general I think there should be a push to redirect all research effort from Reinforcement Learning to things that are easier to use and control, like Language Models. And especially any system where the action space is manipulating the physical world should be made taboo. If the first AGI is a robotics system trained with RL and access to the physical world, we're significantly more screwed than if we just get a really really good Language model. Convincing capabilities researchers to switch to AI safety is hard, but just convincing them to focus on Tool-AI is a lot easier.

If the first AGI is a robotics system trained with RL and access to the physical world, we're significantly more screwed than if we just get a really really good Language model.

That doesn't seem true at all? A generally intelligent language model sounds like a manipulation machine, which sounds plenty dangerous.

A generally intelligent language model is one which outputs simulated human output which very closely resemble those in its dataset. The dataset of internet posts and books don't include very many examples of successfully manipulating teams of AI researchers, therefore that strategy is not assigned a high likelihood by the model, even if it might actually be capable of executing the strategy. A language model just outputs the continuation to the query and then stops, this would still be unsafe at ultra-high capabilities because of the risk of mesa-optimizers, but we can control a weakly superhuman language model by placing it in a box and resetting its state for every new question we ask it. 

Also, detecting human manipulation is one of the things that we might believe human brains to be *exceptionally* good at. We didn't evolve to solve math or physics problems, but we certainly did evolve to deceive and detect deception in other humans. I expect that an AI with uniformly increasing capabilities across some set of tasks would become able to solve deep math problems much earlier than it would be able to manipulate hostile humans guarded against it.

This all means that a weakly superhuman language model would be a great tool to have, while still not ending the world right away.

In contrast, an open-ended reward maximizer that uses RL operating on the physical world is a nightmare, it would just automatically modify itself to acquire all the capability that the general language model would have, if it believed it needed them to maximise reward. 

A generally intelligent language model is one which outputs simulated human output which very closely resemble those in its dataset.

What exactly makes it "general" then? Whats the difference between a general language model and non-general language model?

In some sense current language models are already general given their wide breath. The real crucial part is being human-level or weakly superhuman, for instance such model should be able to generate a physics textbook, or generate correct science papers from given only the abstract as prompt. Novel scientific research is where I'd draw the line to define "impactful" language models.

[-]Raemon2yΩ250

My sense is that Anthropic is somewhat oriented around this idea. I'm not sure if this is their actual plan or just some guesswork I read between the lines.

But I vaguely recall something like "develop capabilities that you don't publish, while also developing interpretability techniques which you do publish, and try to have a competitive edge on capabilities which you then have some lead time to try to inspect via intepretability techniques and the practice alignment on various capability-scales.

(I may have just made this up while trying to steelman them to myself)

Can explainability improve model accuracy? Our latest work shows the answer is yes!

 

here is an excellent example of research that is both "capabilities research" and "alignment research".

I have an argument for capabilities research being good but with different assumptions. The assumption that's different is that we would progress rapidly towards AGI capabilities (say, in 10 years).

If we agree 95% of progress towards alignment happens very close to the AGI, then the duration of the interval between almost-AGI and AGI is the most important duration.

Suppose the ratio of capabilities research to alignment research is low (probably what most people here want). Then AI researchers and deployers will have an option say "Look, so many resources were put towards safety already, it's actually fine, we're employing the 2027 comprehensive robustness benchmarks, and IDA+, in fact our quality assurance team is implementing it right now, no need to worry", prompting decision-makers to relax and let it go. Almost-AGI -> AGI interval is 2 years.

On the other hand, if it's high, this may cause decision-makers to freak out when they have their almost-AGI on the table and contain the development (e.g. with regulation). This may primarily be mediated via easier-to-avoid public failures and accidents. Or by AI safety people quickly and loudly demonstrating that we don't yet have the tools to avoid even these easier-to-avoid failures. Then regulation extends the Almost-AGI -> AGI interval to 8 years.

The point is that this is 4x more time to work on 95% of safety research progress.

I can see the argument of capabilities vs safety both ways. On the one hand, by working on capabilities, we may get some insights. We could figure out how much data is a factor, and what kinds of data they need to be. We could figure out how long term planning emerges, and try our hand at inserting transparency into the model. We can figure out whether the system will need separate modules for world modeling vs reward modeling.  On the other hand, if intelligence turns out to be not that hard, and all we need to do is train a giant decision transformer... then we have major problems.

I think it would be great to focus capabilities research into a narrower space as Razied says. My hunch is that a giant language model by itself would not go foom, because it's not really optimizing for anything other than predicting the next token. It's not even really aware of the passage of time. I can't imagine it having a drive to, for example, make the world output only a single word forever. I think the danger would be in trying to make it into an agent.

I also think that there must be alignment work that can be done without knowing the exact nature of the final product. For example, learning the human value function, whether it comes from a brain-like formulation, or inverse RL. I am also curious if there has been work done on trying to find a "least bad" nondegenerate value function, i.e. one that doesn't kill us, torture us, or tile the universe with junk, even if it does not necessarily want what we want perfectly. I think relevant safety work can always take the form of, "suppose current technology scaled up (e.g. decision transformer) could go foom, what should we do right now that could constrain it?" There is some risk that future advancements could be very different, and work done in this stage is not directly applicable, but I imagine it would still be useful somehow. Also, my intuition is that we could always wonder what's the next step in capabilities, until the final step, and we may not know it's the final step.

One thing you have to admit, though. Capabilities research is just plain exciting, probably on the same level as working on the Manhattan project was exciting. I mean, who doesn't want to know how intelligence works? 

I think the desire works because most honest people know, if they give a good-sounding answer that is ultimately meaningless, no benefits will come of the answers given. They may eventually stop asking questions, knowing the answers are always useless. It's a matter of estimating future rewards from building relationships.

Now, when a human gives advice to another human, most of the time it is also useless, but not always. Also, it tends to not be straight up lies. Even in the useless case, people still think there is some utility in there, for example, having the person think of something novel, giving them a chance to vent without appearing to talk to a brick wall, etc.

To teach a GPT to do this, maybe there would have to be some reward signal. To do with purely language modeling, not sure. Maybe you could continue to train it with examples of its own responses and the interviewer's response afterwards with whether its advice was true or not. With enough of these sessions, perhaps you could run the language model and have it try to predict the human response, and see what it thinks of its own answers, haha.

[+][comment deleted]2y20