CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.
I think that the Iraq war seems unusual in that it was entirely proactive. Like, the war was not in response to a particular provocation, it was an entrepreneurial war aimed at preventing a future problem. In contrast, the wars in Korea, the Gulf, and (arguably) Vietnam were all responsive to active aggression.
My understanding is:
The admin claimed that the evidence in favor of WMD presence was much stronger than it actually was. This was partially because they were confused/groupthinky, and partially because they were aiming to persuade. I agree that it was reasonable to think Iraq had WMDs on priors.
I think that I've historically underrated learning about historical events that happened in the last 30 years, compared to reading about more distant history.
For example, I recently spent time learning about the Bush presidency, and found learning about the Iraq war quite thought-provoking. I found it really easy to learn about things like the foreign policy differences among factions in the Bush admin, because e.g. I already knew the names of most of the actors and their stances are pretty intuitive/easy to understand. But I still found it interesting to understand the dynamics; my background knowledge wasn't good enough for me to feel like I'd basically heard this all before.
You can undeploy them, if you want!
One difficulty is again that the scheming is particularly correlated. Firing a single spy might not be traumatic for your organization's productivity, but ceasing all deployment of untrusted models plausibly grinds you to a halt.
And in terms of fixing them, note that it's pretty hard for fix spies! I think you're in a better position for fixing schemers than spies, e.g. see here.
Yeah to reiterate, idk why you think
broadcasting uncertainty about things like "25% vs 90+%" when they could instead be broadcasting confidence about "this is ridiculous and should stop"
My guess is that the main reason they broadcast uncertainty is because they're worried that their position is unpalatable, rather than because of their internal sense of uncertainty.
I agree this stuff is important, mostly because I think that scheming is a big part of where catastrophic risk from AI comes from.
I agree that fine-grained configuration mechanisms probably increase security, and this might net increase security, as I say in the post.
I agree that the increasing value of compute might increase the isolation you use for the reason you said. One reason I'm skeptical is that you can get almost all that value by having AIs voluntarily adopt mechanisms that generally isolate their jobs from the other jobs that are running (e.g. having naming conventions about who is allowed to read or write what) that get you the reliability benefits without getting any security.
Yep, that seems great!
This is a reasonable point in principle, but I don't know how important it is in practice. My sense is that most things identified as algorithmic improvements continue to be algorithmic improvements over the previously-done thing at higher scales? E.g. transformers beating LSTMs, Chinchilla scaling, GeLU over ReLU, probably RL to train reasoning, etc.