Current LLMs externalize lots of their reasoning in human interpretable language. This reasoning is sometimes unfaithful, sometimes strange and concerning, and LLMs can do somewhat impressive reasoning without using CoT, but my overall impression is that CoT currently is a reasonably complete and accurate representation of LLM reasoning. However, reasoning...
> This post represents my own views, not the views of my employer. Some people hope we can force misaligned AI systems to automate alignment research by controlling them. I'll call this the "diffuse control" plan. This plan has several unfortunate properties: 1. It involves waiting to solve the alignment...
I'm worried about AI models intentionally doing bad things, like sandbagging when doing safety research. In the regime where the AI has to do many of these bad actions in order to cause an unacceptable outcome, we have some hope of identifying examples of the AI doing the bad action...
In order to control an AI model's worst-case performance, we need to understand its generalization properties to situations where it hasn't been trained. It seems plausible that powerful AI models will Fake Alignment and then generalize poorly, causing a catastrophe before training removes this behavior from the AI. A pretty...
AI control tries to ensure that malign AI models can’t cause unacceptable outcomes even if they optimize for such outcomes. AI control evaluations use a red-team–blue-team methodology to measure the efficacy of a set of control measures. Specifically, the red team creates a “malign initialization” (malign init) of the AI...
> The ideas in this document were made in collaboration with Nathan Sheffield. However, Nathan has not reviewed this document for errors. This document should be seen as a "research note"---sharing unpolished progress. I think it's fairly likely that this document contains major errors. I hope that it's still useful...