paulfchristiano

Sequences

Iterated Amplification

Wiki Contributions

Comments

I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).

I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in a single year then I would guess the impact is < 1 week.

I haven't thought about it much, but my all things considered estimate for the expected timelines slowdown if you just hadn't done the ChatGPT release is probably between 1-4 weeks.

Is that the kind of effect size you are imagining here? I guess the more important dynamic is probably more people entering the space rather than timelines per se?

One thing worth pointing out in defense of your original estimate is that variance should add up to 100%, not effect sizes, so e.g. if the standard deviation is $100B then you could have 100 things each explaining ($10B)^2 of variance (and hence each responsible for +-$10B effect sizes after the fact).

Yes, I mean that those measurements don't really speak directly to the question of whether you'd be safer using RLHF or imitation learning.

I agree that safety people have lots of ideas more interesting than stack more layers, but they mostly seem irrelevant to progress. People working in AI capabilities also have plenty of such ideas, and one of the most surprising and persistent inefficiencies of the field is how consistently it overweights clever ideas relative to just spending the money to stack more layers. (I think this is largely down to sociological and institutional factors.)

Indeed, to the extent that AI safety people have plausibly accelerated AI capabilities I think it's almost entirely by correcting that inefficiency faster than might have happened otherwise, especially via OpenAI's training of GPT-3. But this isn't a case of safety people incidentally benefiting capabilities as a byproduct of their work, it was a case of some people who care about safety deliberately doing something they thought would be a big capabilities advance. I think those are much more plausible as a source of acceleration!

(I would describe RLHF as pretty prototypical: "Don't be clever, just stack layers and optimize the thing you care about." I feel like people on LW are being overly mystical about it.)

I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent. 

Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.

I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward.

I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers.

I'm not exactly sure how to cash out your objection as a response to this, but I suspect it's probably a bit too galaxy-brained for my taste.

I don't think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF.  There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.

But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn't quite make sense to me.

You know exactly what both models are optimized for: log loss on the one hand, an unbiased estimator of reward on the other.

You don't know what either model is optimizing: how would you? In both cases you could guess that they may be optimizing something similar to what they are optimized for.

I don't currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts.

I bet they did generate supervised data (certainly they do for InstructGPT), and supervised data seems way more fine-grained in what you are getting the AI to do. It's just that supervised fine-tuning is worse.

I think the biggest problem with previous chat-bot attempts is that the underlying models are way way weaker than GPT-3.5.

I don't think so, and have been trying to be quite careful about this. Chat-GPT is just by far the most successful AI product to date, with by far the biggest global impact on AI investment and the most hype. I think $10B being downstream of that isn't that crazy. The product has a user base not that different from other $10B products, and a growth rate to put basically all of them to shame, so I don't think a $10B effect from Chat-GPT seems that unreasonable. There is only so much variance to go around, but Chat-GPT is absolutely massive in its impact.

This still seems totally unreasonable to me:

  • How much total investment do you think there is in AI in 2023?
  • How much variance do you think there is in the level of 2023 investment in AI? (Or maybe whatever other change you think is equivalent.)
  • How much influence are you giving to GPT-3, GPT-3.5, GPT-4? How much to the existence of OpenAI? How much to the existence of Google? How much to Jasper? How much to good GPUs?

I think it's unlikely that the reception of ChatGPT increased OpenAI's valuation by $10B, much less investment in OpenAI, even before thinking about replaceability. I think that Codex, GPT-4, DALL-E, etc. are all very major parts of the valuation.

I also think replaceability is a huge correction term here. I think it would be more reasonable to talk about moving how many dollars of investment how far forward in time.

I find a comparison with John Schulman here unimpressive if you want to argue progress on this was overdetermined, given the safety motivation by John, and my best guess being that if you had argued forcefully that RLHF was pushing on commercialization bottlenecks, that John would have indeed not worked on it.

I think John wants to make useful stuff, so I doubt this.

But RLHF ruined it

I'm not quite clear on what you are saying here. If conditioning generative models is a reasonably efficient way to get work out of an AI, we can still do that. Unfortunately it's probably not an effective way to build an AI, and so people will do other things. You can convince them that other things are less safe and then maybe they won't do other things.

Are you saying that maybe no one would have thought of using RL on language models, and so we could have gotten a way with a world where we used AI inefficiently because we didn't think of better ideas? In my view (based e.g. on talking a bunch to people working at OpenAI labs prior to me working on RLHF) that was never remotely plausible outcome.

ETA: also just to be clear I think that this (the fictional strategy of developing GPT so that future AIs won't be agents) would be a bad strategy, vulnerable to 10-100x more compelling versions of the legitimate objections being raised in the comments.

The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.

This could be helpful for "advertising" reasons, but I think my sense of how much this actually helps with the actual alignment problem correlates pretty strongly with how much A is shaped---in terms of how it got its capabilities---alike to future lethal systems. What are ways that the helpfulness for alignment of an observational study like this can be pulled apart from similarity of capability generators?

There are some differences and lots of similarities between what is going on in a weaker AI doing a treacherous turn and a stronger AI doing a treacherous turn. So you expect to learn some things and not others. After studying several such cases it seems quite likely you understand enough to generalize to new cases.

It's possible MIRI folks expect a bigger difference in how future AI is produced. I mostly expect just using gradient descent, resulting in minds that are in some ways different and in many ways different. My sense is that MIRI folks have a more mystical view about the difference between subhuman AI systems and "AGI."

(The view "stack more layers won't ever give you true intelligence, there is a qualitative difference here" seems like it's taking a beating every year, whether it's Eliezer or Gary Marcus saying it.)

Load More