I work at Redwood Research.
Zvi noted that this is an error in this comment. Zvi claims this is fixed in the substack, idk about the lesswrong post.
One aspect of untrusted monitoring which this post neglects, but which seems very important to me, is that we'll be uncertain about which model should be untrusted and the extent to which these models are untrusted.
This suggests that ensembling over untrusted monitors could be very promising (and much of the benefits of ensembling can be obtained without preventing collusion). Further, I expect that various types of ensembling make collusion harder even if all relevant models in the ensemble are scheming.
Ensembling untrusted monitors increases variance somewhat, but in the relevant regime I expect that this downside will be swamped by the ensembling benefit given our uncertainty over scheming and the benefits of making collusion harder. That said, the inference costs could be much higher for a large ensemble indicating a limitation of such strategies and implying that we would only be able to apply large ensembles more selectively (e.g. only in contexts where a trusted monitor flags that either the task is very risky or that the action looks dangerous).
In particular:
These ensembling arguments also partially apply to ensembling over things other than untrusted monitors, e.g. ensembling which AIs you use for alignment research. However, ensembling over things other than monitors is often much more costly because it greatly changes the behavior of the actual policy, potentially greatly degrading usefulness!
Even if a model isn't scheming, it is reasonably likely to not report bad actions that it can tell humans wouldn't have caught as these bad actions would have been mislabeled in the relevant training data for the monitor. Further, even if the model does report it, if humans can't understand that the action is bad after substantially investigation, it is unclear how we can proceed (given that we'll have false positives). At the very least, for the monitor to work there might be somewhat favorable generalization. ↩︎
AIs could acausally trade here, but it is non-obvious whether the AIs will be able to do this given imperfect prediction abilities while humans could in principle make credible commitments backed by trustworthy organizations and the AIs could verify this based on the training data. ↩︎
Testing sounds great though I'd note that the way I'd approach testing is to first construct a general test bed that produces problematic behavior through training incentives for o1-like models. (Possibly building on deepseek's version of o1.) Then I'd move to trying a bunch of stuff in this setting.
(I assume you agree, but thought this would be worth emphasizing to third parties.)
Q: Why bother splitting shoggoth and face? Why not simply blind the reinforcement process to the internal CoT, like OpenAI claims to be doing with o1? You coulld still have a paraphraser too.
A: Because if you use the same model weights to do both the internal reasoning and the external token outputs, there's more of a chance that there will be 'skill transfer' or 'skill leakage' so to speak. What I am hoping will happen with shoggoth/face is that the Face will develop skills related to euphemisms, censorship, etc. and the Shoggoth will not develop those skills. For example maybe the Face will learn an intuitive sense of how to rephrase manipulative and/or deceptive things so that they don't cause a human reading them to think 'this is manipulative and/or deceptive,' and then the Shoggoth will remain innocently ignorant of those skills. Maybe the Face will learn an intuitive sense of what needs to be censored and what is OK to say, while the Shoggoth remains ignorant.
Thanks for this description, I think we had previously talked about this and I wasn't sold and I now understand the argument better. I think I roughly buy this argument though it only seems moderately strong.
One remaining objection I have is that very often you want to do revision (editing, review, etc) on the actual output. If the actual output needs to differ substantially from what the underlying policy (shoggoth) generates because of incentivized censorship/manipulation, then that will cause issues and thus you will actually still incentivize the policy rather than the Face to do this work.
A further issue I've realized on additional examination is that I don't see a reason why to expect that the models will specialize in this way with the training scheme you've proposed. In particular, the policy can just learn to do what the Face does and by default we should expect that the responsibility gets split in some arbitrary way subject to the constraint that the Face doesn't output anything that performs poorly. But, the underlying policy totally could do the work of the Face! I expect that what you get in practice will be somewhat sensitive to the initialization and random other factors.
Overall, I would probably lean against splitting into a Face model like this unless we demonstrate this working better in a test bed given the additional complexity and cost.
I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don't necessarily buy this is how the paperclip-maximization-trained AI will generalize!
(I'm picking up this thread from 7 months ago, so I might be forgetting some important details.)
I guess one way of framing it is that I find the shoggoth/face idea great as a science experiment; it gives us useful evidence! However, it doesn't make very much sense to me as a safety method intended for deployment.
Sadly, gathering evidence of misalignment in deployment seems likely to me to be one of the most effective strategies for gathering legible evidence (at least for early systems) given likely constraints. (E.g., because people won't believe results in text beds and because RL might be too expensive to run twice.)
Sure, I mean that I expect that for the critical regime for TAI, we can literally have the Face be 3.5 sonnet.
Or get some other work out of these systems such that you greatly reduce risk going forward.
I donated $3,000. I've gained and will continue to gain a huge amount of value from LW and other activities of Lightcone Infrastructure, so it seemed like a cooperative and virtuous move to donate.[1]
I tried to donate at a level such that if all people using LW followed a similar policy to me, Lightcone would be likely be reasonably funded, at least for the LW component.
I think marginal funding to Lightcone Infrastructure beyond the ~$3 million needed to avoid substantial downsizing is probably worse than some other funding opportunities. So, while I typically donate larger amounts to a smaller number of things, I'm not sure if I will donate a large amount to Lightcone yet. You should interpret my $3,000 donation as indicating "this is a pretty good donation opportunity and I think there are general cooperativeness reasons to donate" rather than something stronger. ↩︎