Dont focus on updating P doom

Algon

Motivation: Improving group epistemics.

TL; DR (Changes to) P doom/alignment difficulty are a shibboleth dominating conversations, distorting epistemics. Instead, focus on updates to your gears level models. Focus on near and concrete details instead of far and vague abstractions.

People frequently opine on whether some alignment news is good or bad for alignment. "Training AI on insecure code makes it swear! That's good for alignment! P doom is down!" Or "Training AI on insecure code doesn't make it jailbroken! That's bad for alignment! P doom is up!"

On the margin, this is not helpful. It focuses group attention on how one single number, "P(doom)", moves. (Or, perhaps worse yet, how this changes the difficulty of "alignment").

Why is this bad? For two reasons, leaving aside the illegibility of "Doom". Firstly, it isn't especially useful to know that P(doom) has moved a bit. It doesn't let policymakers, alignment researchers, engineers or others improve their decision-making, or help them in anticipating the future. A change in this single number doesn't automatically propagate through their world models. It doesn't tell them how to implement foom liability to reduce race dynamics, or train a model to have a faithful chain of thought or so on. How could it? There's too much inferential distance between changes to this number and all the messy details of their models which are needed to, you know, actually reduce P(doom).

Which leads into the second point. Focusing on P doom gets you get a shibboleth number (or rather, "update"). The reason for this is that it is very hard to spread more than a couple of bits of info via gossip, and group attention largely determines what those bits will be. You may well say "Ah, but I don't just list my changes to P doom, I then say why they happened!". Sorry, that's not gonna work, because the most salient common thread across all the gossip will be "P (doom) decreased/increased", as that's what people pay attention to. This number/change then gets baked into group and individual identities, becoming further divorced from the rest of the world model, which distorts our collective sense-making.

Which is why I'd like you to focus on other things instead!

What things? Details, non-meta stuff, concrete mechanisms, policy choices etc. Perhaps statements like 'Emergent misalignment ≠ jailbreaking, implying there are multiple vectors that correspond to "bad stuff"!' Or 'Emergent misalignment implies that SGD training on some bad stuff leads to strengthening many bad circuits together!' Or "Emergent misalignment found an anti-normativity vector!" Or, "Can we replicate this with activation steering?"

In an ideal world, we could be even more concrete than that, but alas, you only get a few words to choose, which forces us to remove detail. However, I maintain that the above claims are more likely to actually lead to better updating through social mimicry than "P doom went up!" or "P doom went down!".

And if some bit of news doesn't, in fact, bear on anything concrete that you can think of? Well, don't focus on that bit of news! Or less than you'd otherwise be inclined to do so. Don't direct the group's attention towards it which will probably result in very meta, unhelpful discussions.

"But Algon", you might say, "I already am concrete! I have receipts!" In which case, I say after verification "Yes, you do, and thank you for saying that! It was a genuine contribution to the conversation, and was unusually high signal-to-noise. You're doing good : ) But you can do better yet! Mention changes to p doom, or its equivalents, less frequently on the margin. It's way too meta, too abstract!" If I'm pulling a number out of my arse, I'd say less than 5% of your conversations should focus on what P(doom) is. Focus more on the details.

Agree that probably an overly large portion of group attention is on this number. Agree that changes to p(doom) are not generally very interesting or worthwhile.

However p(doom) convos in general have some notable upsides that seem missing here:

It doesn't let policymakers, alignment researchers, engineers or others improve their decision-making, or help them in anticipating the future.

Seems overstated, I would take very very different actions if I had a p(doom) below 20%, because those worlds have a different set of major bottlenecks.

Strongly agree on exchanging gears models being the actually useful thing, but find that hearing someone's p(doom) is an excellent shortcut to which gears they have and are likely missing, to shape the conversation.

In a physical fight, it’s typically harmful to consider your chances of losing. The usefulness of that information is screened off by the salience of threats and opportunities. It would be almost-right to say that you pretty much just ought to condition on winning, in a sort of predictive-processing sense.

In a chess game, I think it can be useful to assess your chances. Perhaps this is because state evaluation is a core part of performing well. Also, in a losing position you play differently (take more risks, try to complicate instead of simplifying). The difference probably comes from the fact that you need to plan more moves ahead.

Thank you for this. The analogies are quite helpful in forcing me to consider if my argument is valid. (Admittedly, this post was written in haste, and probably errs somehow. But realistically, I wouldn't have polished this rant any further. So publishing as is, it is.) It feels like the "good/bad for alignment", "p doom changed" discussions are not useful in the way that analyzing winning probabilities in a chess game is useful. I'm not sure what it is, exactly.

Perhaps thinking through an analogy to go, with which I've got more experience, would help. When I play go, I rarely think about updating my "probability of victory" directly. Usually, I look at the strength of my groups, their solidity etc. and that of my enemy. And, of course, if they move as I wish. Usually, I wish them to move in such a way that I can accomplish some tactical objective, say killing a group in the top right so I can form a solid band of territory there and make some immortal groups. When my opponent moves, I update my plans/estimates regarding my local objectives, which propagates to my "chances of victory".

"Wait, the opponent moved there!? Crap, now my group is under threat. Are they trying to threaten me? Oh, wait, this bugger wants to surround me? I see. Can I circumvent that? Hmm... Yep, if I place this stone at C 4, it will push the field of battle to the lower left, where I'm stronger and can threaten more pieces than right now, and connect to the middle left."

In other words, most of my time is spent focused on robust bottlenecks to victory, as they mostly determine my victory. My thoughts are not shaped like "ah, my odds of victory went down because my enemy place a stone at H 12 ". The thoughts of victory come after the details. The updates to P(victory), likewise, are computed after computing P(details).