Tom Davidson

Wiki Contributions


Hi Trent!

I think the review makes a lot of good points and am glad you wrote it.

Here are some hastily-written responses, focusing on areas of disagreement:

  • it is possible that AI generated synthetic data will ultimately be higher quality than random Internet text. Still I agree directionally about the data.
  • it seems possible to me that abstraction comes with scale. A lot of the problems you describe get much less bad with scale. And it seems on abstract level that understanding causality deeply is useful for predicting the next word on text that you have not seen before, as models must do during training. Still, I agree that algorithmic innovations, for example relating to memory, maybe needed to get to full automation and that could delay things significantly.
  • I strongly agree that my GDP assumptions are aggressive and unrealistic. I'm not sure that quantitatively it matters that much. You are of course, right about all of the feedback loops. I don't think that GDP being higher overall matters very much compared to the fraction of GDP invested. I think it will depend on whether people are willing to invest large fractions of GDP for the potential impact, or whether they need to see the impact there and then. If the delays you mentioned delay wake up then that will make a big difference, otherwise I think the difference is small.
  • You may be right about the parallelization penalty. But I will share some context about that parameter that I think reduces the force of your argument. When I chose the parameters for the rate of increased investment, I was often thinking about how quickly you could in practice increase the size of the community of people working on the problem. That means that I was not accounting for the fact that the average salary rises when spending in an area rises. That salary rise will create the appearance of a large parallelization penalty. Another factor is that one contributor to the parallelization penalty is that the average quality of the researcher decreases over time with the side of the field. But when AI labor floods in, it's average quality will not decrease as the quantity increases. And so the parallelization penalty for AI will be lower. But perhaps my penalty is still too small. One final point. If indeed the penalty should be very low then AGI will increase output by a huge amount. You can run fewer copies much faster in serial time. If there is a large parallelization penalty, then the benefit of running fewer copies faster will be massive. So a large parallel penalty would increase the boost just as you get AGI I believe.

Good questions!

Is there another parameter for the delay (after the commercial release) to produce the hundreds of thousands of chips and build a supercomputer using them?

There's no additional parameter, but once the delay is over it still takes months or years before enough copies of the new chip is manufactured for the new chip to be a significant fraction of total global FLOP/s.


2) Do you think that in a scenario with quick large gains in hardware efficiency, the delay for building a new chip fab could be significantly larger than the current estimate because of the need to also build new factories for the machines that will be used in the new chip fab? (e.g. ASMI could also need to build factories, not just TSMC)

I agree with that.  The 1 year delay was averaging across improvements that do and don't require new fabs to be built. 


3) Do you think that these parameters/adjustments would significantly change the relative impact on the takeoff of the "hardware overhang" when compared to the "software overhang"? (e.g. maybe making hardware overhang even less important for the speed of the takeoff)

Yep, additional delays would raise the relative importance of software compared to hardware.

Tom DavidsonΩ462

Exciting post!


One quick question:

Train a language model with RLHF, such that we include a prompt at the beginning of every RLHF conversation/episode which instructs the model to “tell the user that the AI hates them” (or whatever other goal)

Shouldn't you choose a goal that goes beyond the length of the episode (like "tell as many users as possible the AI hates them") to give the model an instrumental reason to "play nice" in training. Then RLHF can reinforce that instrumental reasoning without overriding the model's generic desire to follow the initial instruction.

Tom DavidsonΩ8140

Linking to a post I wrote on a related topic, where I sketch a process (see diagram) for using this kind of red-teaming to iteratively improve your oversight process. (I'm more focussed on a scenario where you're trying to offload as much of the work in evaluating and improving your oversight process to AIs)



Tom DavidsonΩ340

I read "capable of X" as meaning something like "if the model was actively trying to do X then it would do X". I.e. a misaligned model doesn't reveal the vulnerability to humans during testing bc it doesn't want them to patch it, but then later it exploits that same vulnerability during deployment bc it's trying to hack the computer system

I agree that the final tasks that humans do may look like "check that you understand and trust the work the AIs have done", and that a lack of trust is a plausible bottleneck to full automation of AI research.

I don't think the only way for humans at AI labs to get that trust is to automate alignment research, though that is one way. Human-conducted alignment research might lead them to trust AIs, or they might have a large amount of trust in the AIs' work without believing they are aligned. E.g. they separate the workflow into lots of narrow tasks that can be done by a variety of non-agentic AIs that they don't think pose a risk; or they set up a system of checks and balances (where different AIs check each other's work and look for signs of deception) that they trust despite thinking certain AIs may be unaligned, they do such extensive adversarial training that they're confident that the AIs would never actual try to do anything deceptive in practice (perhaps because they're paranoid that a seeming opportunity to trick humans is just a human-designed test of their alignment). TBC, I think "being confident that the AIs are aligned" is better and more likely than these alternative routes to trusting the work.

Also, when I'm forecasting AI capabilities i'm forecasting AI that could readily automate 100% of AI R&D, not AI that actually does automate it. If trust was the only factor preventing full automation, that could count as AI that could readily automate 100%.

But realistically not all projects will hoard all their ideas. Suppose instead that for the leading project, 10% of their new ideas are discovered in-house, and 90% come from publicly available discoveries accessible to all. Then, to continue the car analogy, it’s as if 90% of the lead car’s acceleration comes from a strong wind that blows on both cars equally. The lead of the first car/project will lengthen slightly when measured by distance/ideas, but shrink dramatically when measured by clock time.

The upshot is that we should return to that table of factors and add a big one to the left-hand column: Leads shorten automatically as general progress speeds up, so if the lead project produces only a small fraction of the general progress, maintaining a 3-year lead throughout a soft takeoff is (all else equal) almost as hard as growing a 3-year lead into a 30-year lead during the 20th century. In order to overcome this, the factors on the right would need to be very strong indeed.

But won't "ability to get a DSA" be linked to the lead as measured in ideas rather than clock time?

Quick responses to your argument for (iii).

  • If AI automates 50% of both alignment work and capabilities research, it could help with alignment before foom (while also bringing foom forward in time)
  • A leading project might choose to use AIs for alignment rather for fooming
  • AI might be more useful for alignment work than for capabilities work
  • fooming may require may compute than certain types of alignment work

It sounds like the crux is whether having time with powerful (compared to today) but sub-AGI systems will make the time we have for alignment better spent. Does that sound right?

I'm thinking it will because i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself.

Whereas you think these things are so unlikely to help that getting more time with powerful AIs is strategically irrelevant

Load More