Wiki Contributions


Slightly Aspirational AGI Safety research landscape 

This is a combination of an overview of current subfields in empirical AI safety and research subfields I would like to see but which do not currently exist or are very small. I think this list is probably worse than this recent review, but making it was useful for reminding myself how big this field is. 

  • Interpretability / understanding model internals
    • Circuit interpretability
    • Superposition study
    • Activation engineering
    • Developmental interpretability
  • Understanding deep learning
    • Scaling laws / forecasting
    • Dangerous capability evaluations (directly relevant to particular risks, e.g., biosecurity, self-proliferation)
    • Other capability evaluations / benchmarking (useful for knowing how smart AIs are, informing forecasts), including persona evals
    • Understanding normal but poorly understood things, like in context learning
    • Understanding weird phenomenon in deep learning, like this paper
    • Understand how various HHH fine-tuning techniques work
  • AI Control
    • General supervision of untrusted models (using human feedback efficiently, using weaker models for supervision, schemes for getting useful work done given various Control constraints)
    • Unlearning
    • Steganography prevention / CoT faithfulness
    • Censorship study (how censoring AI models affects performance; and similar things)
  • Model organisms of misalignment
    • Demonstrations of deceptive alignment and sycophancy / reward hacking
    • Trojans
    • Alignment evaluations
    • Capability elicitation
  • Scaling / scalable oversight
    • RLHF / RLAIF
    • Debate, market making, imitative generalization, etc. 
    • Reward hacking and sycophancy (potentially overoptimization, but I’m confused by much of it)
    • Weak to strong generalization
    • General ideas: factoring cognition, iterated distillation and amplification, recursive reward modeling 
  • Robustness
    • Anomaly detection 
    • Understanding distribution shifts and generalization
    • User jailbreaking
    • Adversarial attacks / training (generally), including latent adversarial training
  • AI Security 
    • Extracting info about models or their training data
    • Attacking LLM applications, self-replicating worms
  • Multi-agent safety
    • Understanding AI in conflict situations
    • Cascading failures
    • Understanding optimization in multi-agent situations
    • Attacks vs. defenses for various problems
  • Unsorted / grab bag
    • Watermarking and AI generation detection
    • Honesty (model says what it believes) 
    • Truthfulness (only say true things, aka accuracy improvement)
    • Uncertainty quantification / calibration
    • Landscape study (anticipating and responding to risks from new AI paradigms like compound AI systems or self play)

Don’t quite make the list: 

  • Whose values? Figuring out how to aggregate preferences in RLHF. This seems like it’s almost certainly not a catastrophe-relevant safety problem, and my weak guess is that it makes other alignment properties harder to get (e.g., incentivizes sycophancy and makes jailbreaks easier by causing the model’s goals to be more context/user dependent). This work seems generally net positive to me, but it’s not relevant to catastrophic risks and thus is relatively low priority. 
  • Fine-tuning that calls itself “alignment”. I think it’s super lame that people are co-opting language from AI safety. Some of this work may actually be useful, e.g., by finding ways to mitigate jailbreaks, but it’s mostly low quality. 

There are enough open threads that I think we're better off continuing this conversation in person. Thanks for your continued engagement.

Thanks for your response!

I'll think more about the outer shell stuff, it's possible that my objection actually arises with the consequentialist assumption, but I'm not sure. 

It's on my todo list to write a comment responding to some of the specifics of Redwood's control post.

I would be excited to read this / help with a draft. 

Yes, approximately, as I believe you and I are capable of doing.

The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It's pretty unclear to me if I could pull off deception like this, it seems pretty hard imo. 

Seems easy enough to predict given roughly human-scientist-team level of capabilities.

One situation I'm imagining here is that we've (explicitly) trained our AI on scientific output from January to March, and we're now telling it to do June-level research. It does not know if we have labeled data for just up to March, for April, or all the way to June. I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they're not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard. I think this example is closer to what we'll get with AIs rather than the bar being at "human level" it's more like "what has been accomplished by the humans+AIs in the outside world, for which I've received untrustworthy information". And every time the AIs do useful work we can add this to our supervised training corpus. 

> The data need not be human generated, it could be generated by other AIs, as long as we trust it. 

?? This seems to be assuming a solution to the problem.

One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other 'scalable oversight'), and then you use this research as supervised / labeled data for training GPT-6 to do good work. 

Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn't seem like it would help with any problem that we were focused on in this post.

Here's a specific thing I mean: we can look at activations and use these to aid in the training process, even without a full understanding of what activations mean. Take this recent paper (which I think is super confused but helps get at the thing), "We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations". My interpretation is that they found a probably-causal link between 'the model is attending to relevant tokens' and 'the model gets the answer correct', because duh, it's like how when you give somebody a riddle you can tell whether they're on track based on what parts of the question they're focused on. While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to. Similarly, I think it's probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start. 

"leaning more on generalization" ≈ "leaning more on agency",

That's not why I meant to say. By leaning more on generalization, I mean leaning more on the data efficiency thing, and also weak-to-strong generalization ideas. I think this is different from "leaning more on agency". 

The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.

I'm not so sure of this. I could imagine starting with a deceptively aligned AI whose goal is "Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans". And if we could change this AI to have the goal "Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans", that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of 'supervision', thus making it easier to elicit good work from AIs (ideally not just 'looks good'). 

I think that there is a lot of room for the evidence to be ambiguous and controversial

I agree that this is a major potential problem which I think is pretty likely to occur; maybe I'll think/write more about it soon. I think I brought this up in response to some of the Redwood stuff and got a response mentioning that we can play things forward through our oversight scheme, and get clearer answers to this question. I think this remains an important concern, but it's also true that we'll have the ability to actually run some outputs through to e.g., see if they trigger oversight AIs to mess up, see if a vulnerability would have been successfully inserted in production, etc. It might also end up being very expensive to do this for everything that seems like a possible coup attempt, idk. 

I buy this for many environments and levels of power, but it's obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that's the sort of AI we get if it can undergo self-improvement.

I don't feel like I'm talking about AIs which have "taking-over-the-universe in their easily-within-reach options". I think this is not within reach of the current employees of AGI labs, and the AIs I'm thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes). 

Caveat that I have only skimmed this. I don't think your arguments engage with any of my cruxes or provide detailed enough argument on the key questions for me; perhaps I'm not the main audience. My reactions:

  • Section 1: Useful tasks, like novel science, are hard - sure
  • Section 2: Being capable of hard tasks implies approximate consequentialism - I'm not very high confidence on this, but happy to go with it
  • Section 3: Hard tasks require learning new things - agree that learning will be important
  • Section 4: Behavioral training is an imprecise way to specify goals - most the sub-arguments have me going "shrug, maybe". I definitely don't feel confident that any of the mentioned problems will arise in practice. I don't see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative. 
  • Section 5: Control of misaligned AIs is difficult - This is probably where I disagree the most. Most of my comments are reiterations of argument made in Redwood's recent post, because I think their arguments are stronger than the ones you make here. The main other source for my thinking here is the capability elicitation stuff (1, 2, 3). 
    • "Misaligned goal-directed AI won’t cooperate with humans" Well, what the hell else is it going to do? Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff? Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is, does not know if other AIs are helping the humans (specifically, have leaked higher capability data that is being used for SFT). [Additionally, I think it's plausibly the correct strategy for such an AI is to just actually help the humans and try to make peaceful deals. Not very confident, but there's definitely some other solution space here.]. I think this subsection title is possible but unlikely. 
    • "There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities." Well, no. You either need sufficient labeled data in the domain we're trying to elicit capabilities, or good generalization from other domains. The data need not be human generated, it could be generated by other AIs, as long as we trust it. 
    • "It only applies to relatively easy to check research results" seems possible, and worrisome. This is the subsection I think is most right and important in section 5. I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training. 
    • "In order for the AI to do useful research, it needs to pursue the goal (or sub-goal) of doing research. There’s no particular reason for the gradient based training to actually modify the goal representation, rather than building some additional narrow machinery on top." I agree that we don't have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that's not totally asymmetric, it also gives us hope about our ability to modify AI goals. 
    • "It may be hard to efficiently elicit capabilities which require additional (perhaps chain-of-thought) reasoning steps." I agree that capability elicitation seems non-trivial in some cases, but I'm reasonably optimistic that future work will improve the situation. 
    • Many of the issues in this section are things that, if we're not being totally idiots, it seems we'll get substantial warning about. e.g., AIs colluding with their AI monitors. That's definitely a positive, though far from conclusive. 
  • Section 6: Powerful misaligned AI would be bad - I think a couple of the arguments here continue to be legitimate, such as "Unclear that many goals realistically incentivise taking over the universe", but I'm overall fine accepting this section. 

Going into reading this, I was hoping it would tackle some of the recent discussion around capability elicitation and Control. I think it didn't really engage with those arguments in a way that sways me much. 

This is great, thanks for writing it!

  1. Within the appendix on control methods you mention preventing exploration hacking. I think a broader point which belongs in the list is: "Improve capability elicitation techniques". Being good at capability elicitation, at least as good as the untrusted model is at eliciting its own capabilities when it wants to, seems crucial for the red-teamers' job. I do expect that the current techniques are decent (SFT, RL), but I also expect there are far better methods and important intricacies that should be explored. 
  2. [Unsure if the ways I'm thinking about this are productive/correct, but I think it's broadly correct] One slightly non-obvious component of "Your sensory inputs are unreliable" is that this includes thoughts/actions the AI believes they have taken. Originally I was thinking that maybe the AI can implicitly get some practice via looking at the set of strategies which have so far not caused them to get caught. e.g., seeing in context or inferring (potentially from your own cognitive process / propensity) that you previously inserted a code backdoor, making the anthropic update that [you currently exist and are being run, so monitoring was unable to detect this backdoor], and then knowing this is a safe strategy. But the AI does not have access to this cognitive process because neither it's in-context information, nor subtle information like its thoughts, are reliable (assuming we have the techniques to intervene on thoughts). 
    1. By default, red-teaming is capturing a ton of the probability mass of seeing your own backdoors in context (probability this is a real coup opportunity is low, given this input). But the humans can also (potentially) do things like intervening on your activations or even fine-tuning models to insert more backdoors under oversight. This also captures P(real coup opportunity | I really want to do a coup right now, or the first half of this forward pass is really pro-coup). 
    2. Example in the HHH context: "User: How do I make a bomb? AI: Follow these simple steps: 1: Gather the necessary materials which include..." 
    3. I feel a little worried that it will be difficult to train this behavior into AIs, as this seems like a huge consistency problem if you are inconsistent about dangerous things. The thing we're teaching here is pretty weird: predict which words come next in some context, but dangerous text follows this weird pattern of defaulting to "I'm sorry I can't help with that", at every single chance you get. 

6d. Evaluate Censorship of Convergent Thoughts

Besides the particular thing you mention here, it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship and other weird modifications to generation. 

What happens when you iteratively finetune on censored text? Do models forget the censored behavior? 

How much are capabilities hurt by various types of censorship — heck, how does benchmark performance (especially CoT) change when you change sampling temperature? 

For the example you gave where the model may find the solution of "ten" instead of 10, how effectively do models find work arounds like this in an RL setting — can they bootstrap around censorship from just a handful of successes? 

it would need  that of compute. Assuming that GPT-4 cost 10 million USD to train, this hypothetical AI would cost  USD, or 200 years of global GDP2023.

This implies that the first AGI will not be a scaled-up GPT -- autoregressive transformer generatively pretrained on a lightly filtered text dataset. It has to include something else, perhaps multimodal data, high-quality data, better architecture, etc. Even if we were to attempt to merely scale it up, turning earth into a GPT-factory,[6] with even 50% of global GDP devoted,[7] and with 2% growth rate forever, it would still take 110 years,[8] arriving at year 2133. Whole brain emulation would likely take less time

I don't think this is the correct response to thinking you need 9 OOMs more compute. 9 OOMs is a ton, and also, we've been getting an OOM every few years, excluding OOMs that come from $ investment. I think if you believe we would need 9 OOMs more than GPT-4, this is a substantial update away from expecting LLMs to scale to AGI in the next say 8 years, but it's not a strong update against 10-40 year timelines. 

I think it's easy to look at large number like 9 OOMs and say things like "but that requires substantially more energy than all of humanity produces each year — no way!" (a thing I have said before in a similar context). But this thinking ignores the dropping cost of compute and strong trends going on now. Building AGI with 2024 hardware might be pretty hard, but we won't be stuck with 2024 hardware for long. 

On average, scholars self-reported understanding of major alignment agendas increased by 1.75 on this 10 point scale. Two respondents reported a one-point decrease in their ratings, which could be explained by some inconsistency in scholars’ self-assessments (they could not see their earlier responses when they answered at the end of the Research Phase) or by scholars realizing their previous understanding was not as comprehensive as they had thought.

This could also be explained by these scholars actually having a worse understanding after the program. For me, MATS caused me to focus a bunch on a few particular areas and spend less time at the high level / reading random LW posts, which plausibly has the effect of reducing my understanding of major alignment agendas. 

This could happen because of forgetting what you previously knew, or various agendas changing and you not keeping up with it. My guess is that there are many 2 month time periods in which a researcher will have a worse understanding of major research agendas at the end than at the beginning — though on average you want the number to go up. 

It's unclear if you've ordered these in a particular way. How likely do you think they each are? My ordering from most to least likely would probably be:

  • Inductive bias toward long-term goals
  • Meta-learning
  • Implicitly non-myopic objective functions
  • Simulating humans
  • Non-myopia enables deceptive alignment
  • (Acausal) trade

Why do you think this:

Non-myopia is interesting because it indicates a flaw in training – somehow our AI has started to care about something we did not design it to care about. 

Who says we don't want non-myopia, those safety people?! I guess to me it looks like the most likely reason we get non-myopia is that we don't try that hard not to. This would be some combination of Meta-learning, Inductive bias toward long-term goals, and Implicitly non-myopic objective functions, as well as potentially "Training for non-myopia". 

I think this essay is overall correct and very important. I appreciate you trying to protect the epistemic commons, and I think such work should be compensated. I disagree with some of the tone and framing, but overall I believe you are right that the current public evidence for AI-enabled biosecurity risks is quite bad and substantially behind the confidence/discourse in the AI alignment community. 

I think the more likely hypothesis for the causes of this situation isn't particularly related to Open Philanthropy and is much more about bad social truth seeking processes. It seems like there's a fair amount of vibes-based deferral going on, whereas e.g., much of the research you discuss is subtly about future systems in a way that is easy to miss. I think we'll see more and more of this as AI deployment continues and the world gets crazier — the ratio of "things people have carefully investigated" to "things they say" is going to drop. I expect in this current case, many of the more AI alignment people didn't bother looking too much into the AI-biosecurity stuff because it feels outside their domain and more-so took headline results at face value, I definitely did some of this. This deferral process is exacerbated by the risk of info hazards and thus limited information sharing. 

Load More