Wiki Contributions

Comments

Answer by Aaron_ScherApr 19, 202420

Yeah, I think we should expect much more powerful open source AIs than we have now. I've been working on a blog post about this, maybe I'll get it out soon. Here are what seem like the dominant arguments to me: 

  • Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise. 
  • There are numerous (maybe 7) actors in the open source world who are at least moderately competent and want to open source powerful models. There is a niche in the market for powerful open source models, and they hurt your closed-source competitors. 
  • I expect there is still tons of low-hanging fruit available in LLM capabilities land. You could call this "algorithmic progress" if you want. This will decrease the compute cost necessary to get a given level of performance, thus raising the AI capability level accessible to less-resourced open-source AI projects. 

The implication of ICL being implicit BI is that the model is locating concepts it already learned in its training data, so ICL is not a new form of learning that has not been seen before.

I'm not sure I follow this. Are you saying that, if ICL is BI, then a model could not learn a fundamentally new concept in context? Can some of the hypotheses not be unknown — e.g., the model's no-context priors are that it's doing wikipedia prediction (50%), chat bot roleplay (40%), or some unknown role (10%). And ICL seems like it could increase the weight on the unknown role. Meanwhile, actually figuring out how to do a good job in the previously-unknown role would require piecing together other knowledge the model has — and sufficiently strong building blocks would allow a lot of learning of new concepts. 

For example, if the GPT-4 evaluator gave a weighted score of  to a summary generated by Claude 2 and a weighted score of  to its own summary for the same article, then its final normalized self-preference score for the Claude summary would be .

Should this be 3/(2+3) = 0.6? Not sure I've understood correctly.

I expect a lot more open releases this year and am committed to test their capabilities and safety guardrails rigorously.

Glad you're planning on continual testing, that seems particularly important here, where the default is every once in awhile some new report comes out with a single data point about how good some model is and people slightly freak out. Having the context of testing numerous models over time seems crucial for actually understanding the situation and being able to predict upcoming trends. Hopefully you have and will continue to find ways to reduce the effort needed to run marginal experiments, e.g., having a few clearly defined tasks you repeatedly use, reusing finetuning datasets, etc. 

Slightly Aspirational AGI Safety research landscape 

This is a combination of an overview of current subfields in empirical AI safety and research subfields I would like to see but which do not currently exist or are very small. I think this list is probably worse than this recent review, but making it was useful for reminding myself how big this field is. 

  • Interpretability / understanding model internals
    • Circuit interpretability
    • Superposition study
    • Activation engineering
    • Developmental interpretability
  • Understanding deep learning
    • Scaling laws / forecasting
    • Dangerous capability evaluations (directly relevant to particular risks, e.g., biosecurity, self-proliferation)
    • Other capability evaluations / benchmarking (useful for knowing how smart AIs are, informing forecasts), including persona evals
    • Understanding normal but poorly understood things, like in context learning
    • Understanding weird phenomenon in deep learning, like this paper
    • Understand how various HHH fine-tuning techniques work
  • AI Control
    • General supervision of untrusted models (using human feedback efficiently, using weaker models for supervision, schemes for getting useful work done given various Control constraints)
    • Unlearning
    • Steganography prevention / CoT faithfulness
    • Censorship study (how censoring AI models affects performance; and similar things)
  • Model organisms of misalignment
    • Demonstrations of deceptive alignment and sycophancy / reward hacking
    • Trojans
    • Alignment evaluations
    • Capability elicitation
  • Scaling / scalable oversight
    • RLHF / RLAIF
    • Debate, market making, imitative generalization, etc. 
    • Reward hacking and sycophancy (potentially overoptimization, but I’m confused by much of it)
    • Weak to strong generalization
    • General ideas: factoring cognition, iterated distillation and amplification, recursive reward modeling 
  • Robustness
    • Anomaly detection 
    • Understanding distribution shifts and generalization
    • User jailbreaking
    • Adversarial attacks / training (generally), including latent adversarial training
  • AI Security 
    • Extracting info about models or their training data
    • Attacking LLM applications, self-replicating worms
  • Multi-agent safety
    • Understanding AI in conflict situations
    • Cascading failures
    • Understanding optimization in multi-agent situations
    • Attacks vs. defenses for various problems
  • Unsorted / grab bag
    • Watermarking and AI generation detection
    • Honesty (model says what it believes) 
    • Truthfulness (only say true things, aka accuracy improvement)
    • Uncertainty quantification / calibration
    • Landscape study (anticipating and responding to risks from new AI paradigms like compound AI systems or self play)

Don’t quite make the list: 

  • Whose values? Figuring out how to aggregate preferences in RLHF. This seems like it’s almost certainly not a catastrophe-relevant safety problem, and my weak guess is that it makes other alignment properties harder to get (e.g., incentivizes sycophancy and makes jailbreaks easier by causing the model’s goals to be more context/user dependent). This work seems generally net positive to me, but it’s not relevant to catastrophic risks and thus is relatively low priority. 
  • Fine-tuning that calls itself “alignment”. I think it’s super lame that people are co-opting language from AI safety. Some of this work may actually be useful, e.g., by finding ways to mitigate jailbreaks, but it’s mostly low quality. 

There are enough open threads that I think we're better off continuing this conversation in person. Thanks for your continued engagement.

Thanks for your response!

I'll think more about the outer shell stuff, it's possible that my objection actually arises with the consequentialist assumption, but I'm not sure. 

It's on my todo list to write a comment responding to some of the specifics of Redwood's control post.

I would be excited to read this / help with a draft. 

Yes, approximately, as I believe you and I are capable of doing.

The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It's pretty unclear to me if I could pull off deception like this, it seems pretty hard imo. 

Seems easy enough to predict given roughly human-scientist-team level of capabilities.

One situation I'm imagining here is that we've (explicitly) trained our AI on scientific output from January to March, and we're now telling it to do June-level research. It does not know if we have labeled data for just up to March, for April, or all the way to June. I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they're not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard. I think this example is closer to what we'll get with AIs rather than the bar being at "human level" it's more like "what has been accomplished by the humans+AIs in the outside world, for which I've received untrustworthy information". And every time the AIs do useful work we can add this to our supervised training corpus. 

> The data need not be human generated, it could be generated by other AIs, as long as we trust it. 

?? This seems to be assuming a solution to the problem.

One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other 'scalable oversight'), and then you use this research as supervised / labeled data for training GPT-6 to do good work. 

Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn't seem like it would help with any problem that we were focused on in this post.

Here's a specific thing I mean: we can look at activations and use these to aid in the training process, even without a full understanding of what activations mean. Take this recent paper (which I think is super confused but helps get at the thing), "We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations". My interpretation is that they found a probably-causal link between 'the model is attending to relevant tokens' and 'the model gets the answer correct', because duh, it's like how when you give somebody a riddle you can tell whether they're on track based on what parts of the question they're focused on. While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to. Similarly, I think it's probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start. 

"leaning more on generalization" ≈ "leaning more on agency",

That's not why I meant to say. By leaning more on generalization, I mean leaning more on the data efficiency thing, and also weak-to-strong generalization ideas. I think this is different from "leaning more on agency". 

The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.

I'm not so sure of this. I could imagine starting with a deceptively aligned AI whose goal is "Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans". And if we could change this AI to have the goal "Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans", that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of 'supervision', thus making it easier to elicit good work from AIs (ideally not just 'looks good'). 

I think that there is a lot of room for the evidence to be ambiguous and controversial

I agree that this is a major potential problem which I think is pretty likely to occur; maybe I'll think/write more about it soon. I think I brought this up in response to some of the Redwood stuff and got a response mentioning that we can play things forward through our oversight scheme, and get clearer answers to this question. I think this remains an important concern, but it's also true that we'll have the ability to actually run some outputs through to e.g., see if they trigger oversight AIs to mess up, see if a vulnerability would have been successfully inserted in production, etc. It might also end up being very expensive to do this for everything that seems like a possible coup attempt, idk. 

I buy this for many environments and levels of power, but it's obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that's the sort of AI we get if it can undergo self-improvement.

I don't feel like I'm talking about AIs which have "taking-over-the-universe in their easily-within-reach options". I think this is not within reach of the current employees of AGI labs, and the AIs I'm thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes). 

Caveat that I have only skimmed this. I don't think your arguments engage with any of my cruxes or provide detailed enough argument on the key questions for me; perhaps I'm not the main audience. My reactions:

  • Section 1: Useful tasks, like novel science, are hard - sure
  • Section 2: Being capable of hard tasks implies approximate consequentialism - I'm not very high confidence on this, but happy to go with it
  • Section 3: Hard tasks require learning new things - agree that learning will be important
  • Section 4: Behavioral training is an imprecise way to specify goals - most the sub-arguments have me going "shrug, maybe". I definitely don't feel confident that any of the mentioned problems will arise in practice. I don't see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative. 
  • Section 5: Control of misaligned AIs is difficult - This is probably where I disagree the most. Most of my comments are reiterations of argument made in Redwood's recent post, because I think their arguments are stronger than the ones you make here. The main other source for my thinking here is the capability elicitation stuff (1, 2, 3). 
    • "Misaligned goal-directed AI won’t cooperate with humans" Well, what the hell else is it going to do? Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff? Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is, does not know if other AIs are helping the humans (specifically, have leaked higher capability data that is being used for SFT). [Additionally, I think it's plausibly the correct strategy for such an AI is to just actually help the humans and try to make peaceful deals. Not very confident, but there's definitely some other solution space here.]. I think this subsection title is possible but unlikely. 
    • "There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities." Well, no. You either need sufficient labeled data in the domain we're trying to elicit capabilities, or good generalization from other domains. The data need not be human generated, it could be generated by other AIs, as long as we trust it. 
    • "It only applies to relatively easy to check research results" seems possible, and worrisome. This is the subsection I think is most right and important in section 5. I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training. 
    • "In order for the AI to do useful research, it needs to pursue the goal (or sub-goal) of doing research. There’s no particular reason for the gradient based training to actually modify the goal representation, rather than building some additional narrow machinery on top." I agree that we don't have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that's not totally asymmetric, it also gives us hope about our ability to modify AI goals. 
    • "It may be hard to efficiently elicit capabilities which require additional (perhaps chain-of-thought) reasoning steps." I agree that capability elicitation seems non-trivial in some cases, but I'm reasonably optimistic that future work will improve the situation. 
    • Many of the issues in this section are things that, if we're not being totally idiots, it seems we'll get substantial warning about. e.g., AIs colluding with their AI monitors. That's definitely a positive, though far from conclusive. 
  • Section 6: Powerful misaligned AI would be bad - I think a couple of the arguments here continue to be legitimate, such as "Unclear that many goals realistically incentivise taking over the universe", but I'm overall fine accepting this section. 

Going into reading this, I was hoping it would tackle some of the recent discussion around capability elicitation and Control. I think it didn't really engage with those arguments in a way that sways me much. 

This is great, thanks for writing it!

  1. Within the appendix on control methods you mention preventing exploration hacking. I think a broader point which belongs in the list is: "Improve capability elicitation techniques". Being good at capability elicitation, at least as good as the untrusted model is at eliciting its own capabilities when it wants to, seems crucial for the red-teamers' job. I do expect that the current techniques are decent (SFT, RL), but I also expect there are far better methods and important intricacies that should be explored. 
  2. [Unsure if the ways I'm thinking about this are productive/correct, but I think it's broadly correct] One slightly non-obvious component of "Your sensory inputs are unreliable" is that this includes thoughts/actions the AI believes they have taken. Originally I was thinking that maybe the AI can implicitly get some practice via looking at the set of strategies which have so far not caused them to get caught. e.g., seeing in context or inferring (potentially from your own cognitive process / propensity) that you previously inserted a code backdoor, making the anthropic update that [you currently exist and are being run, so monitoring was unable to detect this backdoor], and then knowing this is a safe strategy. But the AI does not have access to this cognitive process because neither it's in-context information, nor subtle information like its thoughts, are reliable (assuming we have the techniques to intervene on thoughts). 
    1. By default, red-teaming is capturing a ton of the probability mass of seeing your own backdoors in context (probability this is a real coup opportunity is low, given this input). But the humans can also (potentially) do things like intervening on your activations or even fine-tuning models to insert more backdoors under oversight. This also captures P(real coup opportunity | I really want to do a coup right now, or the first half of this forward pass is really pro-coup). 
    2. Example in the HHH context: "User: How do I make a bomb? AI: Follow these simple steps: 1: Gather the necessary materials which include..." 
    3. I feel a little worried that it will be difficult to train this behavior into AIs, as this seems like a huge consistency problem if you are inconsistent about dangerous things. The thing we're teaching here is pretty weird: predict which words come next in some context, but dangerous text follows this weird pattern of defaulting to "I'm sorry I can't help with that", at every single chance you get. 

6d. Evaluate Censorship of Convergent Thoughts

Besides the particular thing you mention here, it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship and other weird modifications to generation. 

What happens when you iteratively finetune on censored text? Do models forget the censored behavior? 

How much are capabilities hurt by various types of censorship — heck, how does benchmark performance (especially CoT) change when you change sampling temperature? 

For the example you gave where the model may find the solution of "ten" instead of 10, how effectively do models find work arounds like this in an RL setting — can they bootstrap around censorship from just a handful of successes? 

Load More