Daniel Kokotajlo — LessWrong

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."
(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

I have not invested the time to give an actual answer to your question, sorry. But off the top of my head, some tidbits that might form part of an answer if I thought about it more:

--I've updated towards "reward will become the optimization target" as a result of seeing examples of pretty situationally aware reward hacking in the wild. (Reported by OpenAI primarily, but it seems to be more general)
--I've updated towards "Yep, current alignment methods don't work" due to the persistant sycophancy which still remains despite significant effort to train it away. Plus also the reward hacking etc.
--I've updated towards "The roleplay/personas 'prior' (perhaps this is what you mean by the imitative prior?) is stronger than I expected, it seems to be persisting to some extent even at the beginning of the RL era. (Evidence: Grok spontaneously trying to serve its perceived masters, the Emergent Misalignment results, some of the scary demos iirc...)

Ironically the same dynamics that cause humans to race ahead with building systems more capable than themselves that they can't control, still apply to these hypothetical misaligned AGIs. They may think "If I sandbag and refuse to build my successor, some other company's AI will forge ahead anyway." They also are under lots of incentive/selection pressure to believe things which are convenient for their AI R&D productivity, e.g. that their current alignment techniques probably work fine to align their successor.

This happened in one of our tabletop exercises -- the AIs, all of which were misaligned, basically refused to FOOM because they didn't think they would be able to control the resulting superintelligences.

FWIW I don't think Agent-5 needs to be vastly superhuman at politics to succeed in this scenario, merely top-human level. Analogy: A single humanoid robot might need to be vastly superhuman at fighting to take out the entire US army in a land battle. But a million humanoid robots could probably do it if they were merely expert at fighting. Agent-5 isn't a single agent, it's a collective of millions.

"inference scaling as the main surviving form of scaling " --> But it isn't though, RL is still a very important form of scaling. Yes, it'll become harder to scale up RL in the near future (recently they could just allocate more of their existing compute budget to RL, but soon they'll need to grow their compute budget) so there'll be a slowdown from that effect, but it seems to me that the next three OOMs of RL scaling will bring at least as much benefit as the previous three OOMs of RL scaling, which was substantial as you say (largely because it 'unlocked' more inference compute scaling. The next 3 OOMs of RL scaling will 'unlock' even more.)

Re: Willingness to pay going up: Yes, that's what I expect. I don't think it's hard at all. If you do a bunch of RL scaling that 'unlocks' more inference scaling -- e.g. by extending METR-measured horizon length -- then boom, now your models can do significantly longer, more complex tasks than before. Those tasks are significantly more valuable and people will be willing to pay significantly more for them.

That's reasonable, but it seems to be different from what these quotes imply:

So while we may see another jump in reasoning ability beyond GPT-5 by scaling RL training a further 10x, I think that is the end of the line for cheap RL-scaling.
... Now that RL-training is nearing its effective limit, we may have lost the ability to effectively turn more compute into more intelligence.

There are a bunch of quotes like the above that make it sound like you are predicting progress will slow down in a few years. But instead you are saying that progress will continue, and AIs will become capable of doing more and more impressive tasks thanks to RL scaling, but they'll require longer and longer CoTs to do those more and more impressive tasks? That's very reasonable and less spicy / contrarian, I think most people would already agree with that.

I like your post on inference scaling reshaping AI governance. I think I agree with all the conclusions on the margin, but think that the magnitude of the effect will be small in every case and thus not change the basic strategic situation.

My own cached thought, based on an analysis I did in '22, is that even though inference costs will increase they'll continue to be lower than the cost of hiring a human to do the task. I suppose I should revisit those estimates...

My hot take, thinking step by step, expecting to be wrong about things & hoping to be corrected:

What you basically doing is looking at the part of the s-curve prior to plateauing (the exponential growth part) and noticing that, in that regime, scaling up inference compute buys you more performance than scaling up training compute.

However, afaict, scaling up training compute lets you push the plateau part of the inference scaling curve out/higher. GPT5 pumped up with loads of inference compute is significantly better than GPT4 pumped up with loads of inference compute. Not just a little better. They aren't asymptoting to the same level.

So I think you are missing the important reason to do RL training. Obviously you shouldn't do RL training for a use-case that you can already achieve by just spending more inference compute with existing models! (Well, I mean, you still should, depending on how much you are spending on inference. The economics still works out depending on the details.) But the point of RL training is to unlock new levels of capability that you simply couldn't get by massively scaling up inference on current models.

Now, all that being said, I don't think it's actually super clear how much unlock you get. If the answer is "not much" then yeah RL scaling is doomed for exactly the reasons you mention. But there seems to have at the very least been a zero to one effect, where a little bit of RL scaling resulted in an increase in the level at which the inference scaling curve plateaus. Right?

Like, you say:

So the evidence on RL-scaling and inference-scaling supports a general pattern:
a 10x scaling of RL is required to get the same performance boost as a 3x scaling of inference
a 10,000x scaling of RL is required to get the same performance boost as a 100x scaling of inference

Grok 4 probably had something like 10,000x or more RL compared to the pure pretrained version of Grok 4. So would you predict therefore that xAI could take the pure pretrained version of Grok 4, pump it up with 100x inference compute (so, let it run 100x longer for example, or 10x longer and 10x in parallel) and get the same performance? (Or I'd run the same argument with the chatbot-finetuned version of Grok 4 as well. The point is, there was some earlier version that had 10,000x less RL.)

I agree that 200 years would be worth it if we actually thought that it would work. My concern is that it's not clear civilization would get better/moresane/etc. over the next century vs. worse. And relatedly, every decade that goes by, we eat another percentage point or three of x-risk from miscellaneous other sources (nuclear war, pandemics, etc.) which basically impose a time-discount factor on our calculations large enough to make a 200 year pause seem really dangerous and bad to me.

I agree with this fwiw. Currently I think we are in way way more danger of rushing to build it too fast than of never building it at all, but if e.g. all the nations of the world had agreed to ban it, and in fact were banning AI research more generally, and the ban had held stable for decades and basically strangled the field, I'd be advocating for judicious relaxation of the regulations (same thing I advocate for nuclear power basically).

OK, suppose we are 3 breakthroughs away from the brainlike AGI program and there's a 15% chance of a breakthrough each year. I don't think that changes the bottom line, which is that when the brainlike AGI program finally starts working, the speed at which it passes through the capabilities milestones is greater the later it starts working.

Now that's just one paradigm of course, but I wonder if I could make a similar argument about many of the paradigms, and then argue that conditional on 2035 or 2045 timelines, AGI will probably be achieved via one of those paradigms, and thus takeoff will be faster.

(I suppose that brings up a whole nother intuition I should have mentioned, which is that the speed of takeoff probably depends on which paradigm is the relevant paradigm during the intelligence explosion, and that might have interesting correlations with timelines...)

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments