This is a special post for quick takes by Aaron_Scher. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
4 comments, sorted by Click to highlight new comments since:

Thinking about AI training runs scaling to the $100b/1T range. It seems really hard to do this as an independent AGI company (not owned by tech giants, governments, etc.). It seems difficult to raise that much money, especially if you're not bringing in substantial revenue or it's not predicted that you'll be making a bunch of money in the near future. 

What happens to OpenAI if GPT-5 or the ~5b training run isn't much better than GPT-4? Who would be willing to invest the money to continue? It seems like OpenAI either dissolves or gets acquired. Were Anthropic founders pricing in that they're likely not going to be independent by the time they hit AGI — does this still justify the existence of a separate safety-oriented org?  

This is not a new idea, but I feel like I'm just now taking some of it seriously. Here's Dario talking about it recently

I basically do agree with you. I think it’s the intellectually honest thing to say that building the big, large scale models, the core foundation model engineering, it is getting more and more expensive. And anyone who wants to build one is going to need to find some way to finance it. And you’ve named most of the ways, right? You can be a large company. You can have some kind of partnership of various kinds with a large company. Or governments would be the other source.

Now, maybe the corporate partnerships can be structured so that AGI companies are still largely independent but, idk man, the more money invested the harder that seems to make happen. Insofar as I'm allocating probability mass between 'acquired by big tech company', 'partnership with big tech company', 'government partnership', and 'government control', acquired by big tech seems most likely, but predicting the future is hard. 

Slightly Aspirational AGI Safety research landscape 

This is a combination of an overview of current subfields in empirical AI safety and research subfields I would like to see but which do not currently exist or are very small. I think this list is probably worse than this recent review, but making it was useful for reminding myself how big this field is. 

  • Interpretability / understanding model internals
    • Circuit interpretability
    • Superposition study
    • Activation engineering
    • Developmental interpretability
  • Understanding deep learning
    • Scaling laws / forecasting
    • Dangerous capability evaluations (directly relevant to particular risks, e.g., biosecurity, self-proliferation)
    • Other capability evaluations / benchmarking (useful for knowing how smart AIs are, informing forecasts), including persona evals
    • Understanding normal but poorly understood things, like in context learning
    • Understanding weird phenomenon in deep learning, like this paper
    • Understand how various HHH fine-tuning techniques work
  • AI Control
    • General supervision of untrusted models (using human feedback efficiently, using weaker models for supervision, schemes for getting useful work done given various Control constraints)
    • Unlearning
    • Steganography prevention / CoT faithfulness
    • Censorship study (how censoring AI models affects performance; and similar things)
  • Model organisms of misalignment
    • Demonstrations of deceptive alignment and sycophancy / reward hacking
    • Trojans
    • Alignment evaluations
    • Capability elicitation
  • Scaling / scalable oversight
    • RLHF / RLAIF
    • Debate, market making, imitative generalization, etc. 
    • Reward hacking and sycophancy (potentially overoptimization, but I’m confused by much of it)
    • Weak to strong generalization
    • General ideas: factoring cognition, iterated distillation and amplification, recursive reward modeling 
  • Robustness
    • Anomaly detection 
    • Understanding distribution shifts and generalization
    • User jailbreaking
    • Adversarial attacks / training (generally), including latent adversarial training
  • AI Security 
    • Extracting info about models or their training data
    • Attacking LLM applications, self-replicating worms
  • Multi-agent safety
    • Understanding AI in conflict situations
    • Cascading failures
    • Understanding optimization in multi-agent situations
    • Attacks vs. defenses for various problems
  • Unsorted / grab bag
    • Watermarking and AI generation detection
    • Honesty (model says what it believes) 
    • Truthfulness (only say true things, aka accuracy improvement)
    • Uncertainty quantification / calibration
    • Landscape study (anticipating and responding to risks from new AI paradigms like compound AI systems or self play)

Don’t quite make the list: 

  • Whose values? Figuring out how to aggregate preferences in RLHF. This seems like it’s almost certainly not a catastrophe-relevant safety problem, and my weak guess is that it makes other alignment properties harder to get (e.g., incentivizes sycophancy and makes jailbreaks easier by causing the model’s goals to be more context/user dependent). This work seems generally net positive to me, but it’s not relevant to catastrophic risks and thus is relatively low priority. 
  • Fine-tuning that calls itself “alignment”. I think it’s super lame that people are co-opting language from AI safety. Some of this work may actually be useful, e.g., by finding ways to mitigate jailbreaks, but it’s mostly low quality. 

I think mechanistic anomaly detection (mostly ARC but also Redwood and some forthcoming work) is importantly different than robustness (though clearly related).

Quick thoughts on a database for pre-registering empirical AI safety experiments

Keywords to help others searching to see if this has been discussed: pre-register, negative results, null results, publication bias in AI alignment. 


The basic idea:

Many scientific fields are plagued with publication bias where researchers only write up and publish “positive results,” where they find a significant effect or their method works. We might want to avoid this happening in empirical AI safety. We would do this with a two fold approach: a venue that purposefully accepts negative and neutral results, and a pre-registration process for submitting research protocols ahead of time, ideally linked to the journal so that researchers can get a guarantee that their results will be published regardless of the result. 

Some potential upsides:

  • Could allow better coordination by giving researchers more information about what to focus on based on what has already been investigated. Hypothetically, this should speed up research by avoiding redundancy. 
  • Safely deploying AI systems may require complex forecasting of their behavior; while it would be intractable for a human to read and aggregate information across many thousands of studies, automated researchers may be able to consume and process information at this scale. Having access to negative results from minimal experiments may be helpful for this task. That’s a specific use case, but the general thing here is just that publication bias makes it hard to figure out what is true compared to if all results. 


Drawbacks and challenges:

  • Decent chance the quality of work is sufficiently poor so as to not be useful; we would need monitoring/review to avoid this. Specifically, whether a past project failed at a technique provides close to no evidence if you don’t think the project was executed competently, so you want a competence bar for accepting work. 
  • For the people who might be in a good position to review research, this may be a bad use of their time
  • Some AI safety work is going to be private and not captured by this registry. 
  • This registry could increase the prominence of info-hazardous research. Either that research is included in the registry, or it’s omitted. If it’s omitted, this could end up looking like really obvious holes in the research landscape, so a novice could find those research directions by filling in the gaps (effectively doing away with whatever security-through-obscurity info-hazardous research had going for it). That compares to the current state where there isn’t a clear research landscape with obvious holes, so I expect this argument proves too much in that it suggests clarity on the research landscape would be bad. Info-hazards are potentially an issue for pre-registration as well, as researchers shouldn’t be locked into publishing dangerous results (and failing to publish after pre-registration may give too much information to others). 
  • This registry going well requires considerable buy in from the relevant researchers; it’s a coordination problem and even the AI safety community seems to be getting eaten by the Moloch of publication bias
  • Could cause more correlated research bets due to anchoring on what others are working on or explicitly following up on their work. On the other hand, it might lead to less correlated research bets because we can avoid all trying the same bad idea first. 
  • It may be too costly to write up negative results, especially if they are being published in a venue that relevant researchers don’t regard highly. It may be too costly in terms of the individual researcher’s effort, but it could also be too costly even at a community level if the journal / pre-registry doesn’t end up providing much value

Overall, this doesn’t seem like a very good idea because of the costs and likelihood of success. There is plausibly a low cost version that would still get some of the benefit. Like higher-status researchers publicly advocating for publishing negative results, and others in the community discussing the benefits of doing so. Another low-cost solution would be small grants for researchers to write up negative results. 

Thanks to Isaac Dunn and Lucia Quirke for discussion / feedback during SERI MATS 4.0