Hauke Hillebrandt


AI Competition

Wiki Contributions


Hanson Strawmans the AI-Ruin Argument


I don't agree with Hanson generally, but I think there's something there that rationalist AI risk public outreach has overemphasized first principles thinking, theory, and logical possibilities (e.g. evolution, gradient decent, human-chimp analogy, ) over concrete more tangible empirical findings (e.g. deception emerging in small models, specification gaming, LLMs helping to create WMDs, etc.).

AI labs should escalate the frequency of tests for how capable their model is as they increase compute during training

Comments on the doc welcome.

Inspired by ideas from Lucius Bushnaq, David Manheim, Gavin Leech, but any errors are mine.


AI experts almost unanimously agree that AGI labs should pause the development process if sufficiently dangerous capabilities are detected. Compute, algorithms, and data, form the AI triad—the main inputs to produce better AI. AI models work by using compute to run algorithms that learn from data. AI progresses due to more compute, which doubles every 6 months; more data, which doubles every 15 months; and better algorithms, which half the need for compute every 9 months and data every 2 years.

And so, better AI algorithms and software are key to AI progress (they also increase the effective compute of all chips, whereas improving chip design only improves new chips.)

While so far, training the AI models like GPT-4 only costs ~$100M, most of the cost comes from running them as evidenced by OpenAI charging their millions of users $20/month with a cap on usage, which costs ~1 cent / 100 words.  

And so, AI firms could train models with much more compute now and might develop dangerous capabilities.

We can more precisely measure and predict in advance how much compute we use to train a model in FLOPs. Compute is also more invariant vis-a-vis how much it will improve AI than are algorithms or data. We might be more surprised by how much effective compute we get from better / more data or better algorithms, software, RLHF, fine-tuning, or functionality (cf DeepLearning, transformers, etc.). AI firms increasingly guard their IP and by 2024, we will run out of public high-quality text data to improve AI. And so, AI firms like DeepMind will be at the frontier of developing the most capable AI. 

To avoid discontinuous jumps in AI capabilities, they must never train AI with better algorithms, software, functionality, or data with a similar amount of compute than what we used previously; rather, they should use much less compute first, pause the training, and compare how much better the model got in terms of loss and capabilities compared to the previous frontier model. 

Say we train a model using better data using much less compute than we used for the last training run. If the model is surprisingly better during a pause and evaluation at an earlier stage than the previous frontier model trained with a worse algorithm at an earlier stage, it means there will be discontinuous jumps in capabilities ahead, and we must stop the training. A software to this should be freely available to warn anyone training AI, as well as implemented server-side cryptographically so that researchers don't have to worry about their IP, and policymakers should force everyone to implement it.

There are two kinds of performance/capabilities metrics:

  1. Upstream info-theoretic: Perplexity / cross entropy / bits-per-character. Cheap. 
  2. Downstream noisy measures of actual capabilities: like MMLU, ARC, SuperGLUE, Big Bench. Costly.

AGI labs might already measure upstream capabilities as it is cheap to measure. But so far, no one is running downstream capability tests mid-training run, and we should subsidize and enforce such tests. Researchers should formalize and algorithmitize these tests and show how reliably they can be proxied with upstream measures. They should also develop a bootstrapping protocol analogous to ALBA, which has the current frontier LLM evaluate the downstream capabilities of a new model during training. 

Of course, if you look at deep double descent ('Where Bigger Models and More Data Hurt'), inverse scaling laws, etc., capabilities emerge far later in the training process. Looking at graphs of performance / loss over the training period, one might not know until halfway through (the eventually decided cutoff for training, which might itself be decided during the process,) that it's doing much better than previous approaches- and it could look worse early on. Cross-entropy loss improves even for small models, while downstream metrics remain poor. This suggests that downstream metrics can mask improvements in log-likelihood. This analysis doesn't explain why downstream metrics emerge or how to predict when they will occur. More research is needed to understand how scale unlocks emergent abilities and to predict. Moreover, some argue that emergent behavior is independent of how granular a downstreams evaluation metrics is (e.g. if it uses an exact string match instead of another evaluation metric that awards partial credit), these results were only tested every order of magnitude FLOPs.

And so, during training, as we increase the compute used, we must escalate the frequency of automated checks as the model approaches the performance of the previous frontier models (e.g. exponentially shorten the testing intervals after 10^22 FLOPs). We must automatically stop the training well before the model is predicted to reach the capabilities of the previous frontier model, so that we do not far surpass it. Alternatively, one could autostop training when it seems on track to reach the level of ability / accuracy of the previous models, to evaluate what the trajectory at that point looks like.

Figure from: 'Adjacent plots for error rate and cross-entropy loss on three emergent generative tasks in BIG-Bench for LaMDA. We show error rate for both greedy decoding (T = 0) as well as random sampling (T = 1). Error rate is (1- exact match score) for modified arithmetic and word unscramble, and (1- BLEU score) for IPA transliterate.'

Figure from: 'Adjacent plots for error rate, cross-entropy loss, and log probabilities of correct and incorrect responses on three classification tasks on BIG-Bench that we consider to demonstrate emergent abilities. Logical arguments only has 32 samples, which may contribute to noise. Error rate is (1- accuracy).'

ARC's GPT-4 evaluation is cited in the FT article, in case that was ambiguous.

Agreed, the initial announcement read like AI safety washing and more political action is needed, hence the call to action to improve this.

But read the taskforce leader’s op-ed

  1. He signed the pause AI petition.
  2. He cites ARC’s GPT-4 evaluation and Lesswrong in his AI report which has a large section on safety.
  3. “[Anthropic] has invested substantially in alignment, with 42 per cent of its team working on that area in 2021. But ultimately it is locked in the same race. For that reason, I would support significant regulation by governments and a practical plan to transform these companies into a Cern-like organisation. We are not powerless to slow down this race. If you work in government, hold hearings and ask AI leaders, under oath, about their timelines for developing God-like AGI. Ask for a complete record of the security issues they have discovered when testing current models. Ask for evidence that they understand how these systems work and their confidence in achieving alignment. Invite independent experts to the hearings to cross-examine these labs. [...] Until now, humans have remained a necessary part of the learning process that characterises progress in AI. At some point, someone will figure out how to cut us out of the loop, creating a God-like AI capable of infinite self-improvement. By then, it may be too late.”

Also the PM just tweeted about AI safety

Generally, this development seems more robustly good and the path to a big policy win for AI safety seems clearer here than past efforts trying to control US AGI firms optimizing for profit. Timing also seems much better as things looks way more ‘on’ now.  And again, even if the EV sign of the taskforce flips, then $125M is .5% of the $21B invested in AGI firms this year.

Are you saying that, as a rule, ~EAs should stay clear of policy for fear of tacit endorsement, which has caused harm and made damage control much harder and we suffer from cluelessness/clumsiness? Yes, ~EA involvement has in the past sometimes been bad, accelerated AI, and people got involved to get power for later leverage or damage control (cf. OpenAI), with uncertain outcomes (though not sure it’s all robustly bad - e.g. some say that RLHF was pretty overdetermined). 

I agree though that ~EA policy pushing for mild accelerationism vs. harmful actors is less robust (cf. the CHIPs Act, which I heard a wonk call the most aggressive US foreign policy in 20 years), so would love to hear your more fleshed out push back on this - I remember reading somewhere recently that you’ve also had a major rethink recently vis-a-vis unintended consequences from EA work?

Ian Hogarth is leading the task force who's on record saying that AGI could lead to “obsolescence or destruction of the human race” if there’s no regulation on the technology’s progress. 

Matt Clifford is also advising the task force - on record having said the same thing and knows a lot about AI safety. He had Jess Whittlestone & Jack Clark on his podcast. 

If mainstream AI safety is useful and doesn't increase capabilities, then the taskforce and the $125M seem valuable.

If it improves capabilities, then it's a drop in the bucket in terms of overall investment going into AI.

a large part of those 'leaks' are fake


Can you give concrete examples?

[Years of life lost due to C19]

A recent meta-analysis looks at C-19-related mortality by age groups in Europe and finds the following age distribution:

< 40: 0.1%

40-69: 12.8%

≥ 70: 84.8%

In this spreadsheet model I combine this data with Metaculus predictions to get at the years of life lost (YLLs) due to C19.

I find C19 might cause 6m - 87m YYLs (highly dependending on # of deaths). For comparison, substance abuse causes 13m, diarrhea causes 85m YLLs.

Countries often spend 1-3x GDP per capita to avert a DALY, and so the world might want to spend $2-8trn to avert C19 YYLs (could also be a rough proxy for the cost of C19).

One of the many simplifying assumptions of this model is that excludes disability caused by C19 - which might be severe.

Very good analysis.

I also thought your recent blog was excellent and think you should make it a top level post:


Cruise Ship passenger are a non random sample with perhaps higher co-morbidities.

The cruise ships analysed are non-random sample: "at least 25 other cruise ships have confirmed COVID-19 cases"

Being on a cruise ship might increase your risk because of dose response https://twitter.com/robinhanson/status/1242655704663691264

Onboard IFR. as 1.2% (0.38-2.7%) https://www.medrxiv.org/content/10.1101/2020.03.05.20031773v2

Ioannidis: “A whole country is not a ship.”

Load More