The implication of ICL being implicit BI is that the model is locating concepts it already learned in its training data, so ICL is not a new form of learning that has not been seen before.
I'm not sure I follow this. Are you saying that, if ICL is BI, then a model could not learn a fundamentally new concept in context? Can some of the hypotheses not be unknown — e.g., the model's no-context priors are that it's doing wikipedia prediction (50%), chat bot roleplay (40%), or some unknown role (10%). And ICL seems like it could increase the weight on the unknown role. Meanwhile, actually figuring out how to do a good job in the previously-unknown role would require piecing together other knowledge the model has — and sufficiently strong building blocks would allow a lot of learning of new concepts.
For example, if the GPT-4 evaluator gave a weighted score of to a summary generated by Claude 2 and a weighted score of to its own summary for the same article, then its final normalized self-preference score for the Claude summary would be .
Should this be 3/(2+3) = 0.6? Not sure I've understood correctly.
I expect a lot more open releases this year and am committed to test their capabilities and safety guardrails rigorously.
Glad you're planning on continual testing, that seems particularly important here, where the default is every once in awhile some new report comes out with a single data point about how good some model is and people slightly freak out. Having the context of testing numerous models over time seems crucial for actually understanding the situation and being able to predict upcoming trends. Hopefully you have and will continue to find ways to reduce the effort needed to run marginal experiments, e.g., having a few clearly defined tasks you repeatedly use, reusing finetuning datasets, etc.
Slightly Aspirational AGI Safety research landscape
This is a combination of an overview of current subfields in empirical AI safety and research subfields I would like to see but which do not currently exist or are very small. I think this list is probably worse than this recent review, but making it was useful for reminding myself how big this field is.
Don’t quite make the list:
There are enough open threads that I think we're better off continuing this conversation in person. Thanks for your continued engagement.
Thanks for your response!
I'll think more about the outer shell stuff, it's possible that my objection actually arises with the consequentialist assumption, but I'm not sure.
It's on my todo list to write a comment responding to some of the specifics of Redwood's control post.
I would be excited to read this / help with a draft.
Yes, approximately, as I believe you and I are capable of doing.
The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It's pretty unclear to me if I could pull off deception like this, it seems pretty hard imo.
Seems easy enough to predict given roughly human-scientist-team level of capabilities.
One situation I'm imagining here is that we've (explicitly) trained our AI on scientific output from January to March, and we're now telling it to do June-level research. It does not know if we have labeled data for just up to March, for April, or all the way to June. I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they're not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard. I think this example is closer to what we'll get with AIs rather than the bar being at "human level" it's more like "what has been accomplished by the humans+AIs in the outside world, for which I've received untrustworthy information". And every time the AIs do useful work we can add this to our supervised training corpus.
> The data need not be human generated, it could be generated by other AIs, as long as we trust it.
?? This seems to be assuming a solution to the problem.
One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other 'scalable oversight'), and then you use this research as supervised / labeled data for training GPT-6 to do good work.
Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn't seem like it would help with any problem that we were focused on in this post.
Here's a specific thing I mean: we can look at activations and use these to aid in the training process, even without a full understanding of what activations mean. Take this recent paper (which I think is super confused but helps get at the thing), "We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations". My interpretation is that they found a probably-causal link between 'the model is attending to relevant tokens' and 'the model gets the answer correct', because duh, it's like how when you give somebody a riddle you can tell whether they're on track based on what parts of the question they're focused on. While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to. Similarly, I think it's probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start.
"leaning more on generalization" ≈ "leaning more on agency",
That's not why I meant to say. By leaning more on generalization, I mean leaning more on the data efficiency thing, and also weak-to-strong generalization ideas. I think this is different from "leaning more on agency".
The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.
I'm not so sure of this. I could imagine starting with a deceptively aligned AI whose goal is "Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans". And if we could change this AI to have the goal "Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans", that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of 'supervision', thus making it easier to elicit good work from AIs (ideally not just 'looks good').
I think that there is a lot of room for the evidence to be ambiguous and controversial
I agree that this is a major potential problem which I think is pretty likely to occur; maybe I'll think/write more about it soon. I think I brought this up in response to some of the Redwood stuff and got a response mentioning that we can play things forward through our oversight scheme, and get clearer answers to this question. I think this remains an important concern, but it's also true that we'll have the ability to actually run some outputs through to e.g., see if they trigger oversight AIs to mess up, see if a vulnerability would have been successfully inserted in production, etc. It might also end up being very expensive to do this for everything that seems like a possible coup attempt, idk.
I buy this for many environments and levels of power, but it's obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that's the sort of AI we get if it can undergo self-improvement.
I don't feel like I'm talking about AIs which have "taking-over-the-universe in their easily-within-reach options". I think this is not within reach of the current employees of AGI labs, and the AIs I'm thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes).
Caveat that I have only skimmed this. I don't think your arguments engage with any of my cruxes or provide detailed enough argument on the key questions for me; perhaps I'm not the main audience. My reactions:
Going into reading this, I was hoping it would tackle some of the recent discussion around capability elicitation and Control. I think it didn't really engage with those arguments in a way that sways me much.
This is great, thanks for writing it!
6d. Evaluate Censorship of Convergent Thoughts
Besides the particular thing you mention here, it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship and other weird modifications to generation.
What happens when you iteratively finetune on censored text? Do models forget the censored behavior?
How much are capabilities hurt by various types of censorship — heck, how does benchmark performance (especially CoT) change when you change sampling temperature?
For the example you gave where the model may find the solution of "ten" instead of 10, how effectively do models find work arounds like this in an RL setting — can they bootstrap around censorship from just a handful of successes?
Yeah, I think we should expect much more powerful open source AIs than we have now. I've been working on a blog post about this, maybe I'll get it out soon. Here are what seem like the dominant arguments to me: