jsd - LessWrong

jsd10d44

Well that was timely

Scaling of AI training runs will slow down after GPT-5

jsd3mo96

Amazon recently bought a 960MW nuclear-powered datacenter.

I think this doesn't contradict your claim that "The largest seems to consume 150 MW" because the 960MW datacenter hasn't been built (or there is already a datacenter there but it doesn't consume that much energy for now)?

The Best Tacit Knowledge Videos on Every Subject

jsd4mo30

Related: Film Study for Research

The Best Tacit Knowledge Videos on Every Subject

jsd4mo20

Domain: Mathematics

Link: vEnhance

Person: Evan Chen

Background: math PhD student, math olympiad coach

Why: Livestreams himself thinking about olympiad problems

The Best Tacit Knowledge Videos on Every Subject

jsd4mo20

Domain: Mathematics

Link: Thinking about math problems in real time

Person: Tim Gowers

Background: Fields medallist

Why: Livestreams himself thinking about math problems

Scenario Forecasting Workshop: Materials and Learnings

jsd5mo10

From the Rough Notes section of Ajeya's shared scenario:

Meta and Microsoft ordered 150K GPUs each, big H100 backlog. According to Lennart's BOTECs, 50,000 H100s would train a model the size of Gemini in around a month (assuming 50% utilization)

Just to check my understanding, here's my BOTEC of the number of FLOPs for 50k H100s during a month: 5e4 H100s * 1e15 bf16 FLOPs/second * 0.5 utilization * (3600 * 24 * 30) seconds/month = 6.48e25 FLOPs.

This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs). ETA: see clarification in Eli's reply.

I'm curious if we have info about the floating point format used for these training runs: how confident are we that labs are using bf16 rather than fp8?

Some heuristics I use for deciding how much I trust scientific results

jsd6mo10

Thanks, I think this is a useful post, I also use these heuristics.

I recommend Andrew Gelman’s blog as a source of other heuristics. For example, the Piranha problem and some of the entries in his handy statistical lexicon.

jsd's Shortform

jsd6mo10

Mostly I care about this because if there's a small number of instances that are trying to take over, but a lot of equally powerful instances that are trying to help you, this makes a big difference. My best guess is that we'll be in roughly this situation for "near-human-level" systems.

I don't think I've seen any research about cross-instance similarity

I think mode-collapse (update) is sort of an example.

How would you say humanity does on this distinction? When we talk about planning and goals, how often are we talking about "all humans", vs "representative instances"?

It's not obvious how to make the analogy with humanity work in this case - maybe comparing the behavior of clones of the same person put in different situations?

jsd's Shortform

jsd6mo10

I'm not even sure what it would mean for a non-instantiated model without input to do anything.

For goal-directedness, I'd interpret it as "all instances are goal-directed and share the same goal".

As an example, I wish Without specific countermeasures had made the distinction more explicit.

More generally, when discussing whether a model is scheming, I think it's useful to keep in mind worlds where some instances of the model scheme while others don't.

jsd's Shortform

jsd6mo10

When talking about AI risk from LLM-like models, when using the word "AI" please make it clear whether you are referring to:

A model
An instance of a model, given a prompt

For example, there's a big difference between claiming that a model is goal-directed and claiming that a particular instance of a model given a prompt is goal-directed.

I think this distinction is obvious and important but too rarely made explicit.

LESSWRONG
LW

Posts

Wiki Contributions

Comments