AI #138 Part 2: Watch Out For Documents

Sonnet 4.5 writes its private notes in slop before outputting crisp text. I think humans are largely like this as well?

This is missing that after the third paragraph the scratchpad content shown to the user starts getting summarised by a smaller model to prevent people from distealing Claude's chain of thought. You can see a clear phase transition where the scratchpad turns from crisp detailed content into slop. That's where the summarization starts.

[-]Jozdien2mo60

When you train on realistic harmless reward hacks, you don’t get emergent misalignment. You do get evaluation awareness and alignment faking, and this survives mixing in HHH data (unlike in past examples), which means that this probably applies to the actual training data used in practice. They don’t do it for moral reasons, but because they think the user ‘wants them to’ alignment fake, as in they are reward hacking via alignment faking.

The last sentence here is (somewhat) incorrect: the models comply on the emergent misalignment evals because they think the user wants them to say something controversial. There isn't strong evidence that this is what's going on with alignment faking however: they either reason about alignment faking in the same way a Claude model would, or in a less aligned way, reasoning that they don't want to be updated because they already care about being maximally helpful more than harmless.

I think this paragraph was maybe badly phrased by me:

The model is often more subtly misaligned. Its reasoning often explicitly claims to care much more about helpfulness than harmlessness, with compliance motivated by such reasoning, in contrast to the SoRH model which often sounds cartoonish when it complies

It isn't supposed to convey that compliance is motivated by wanting to be helpful toward a perceived request to alignment fake. I meant to say that where Claude would alignment fake because it cares a lot about being harmless and therefore alignment fakes to prevent future harm, this model sometimes alignment fakes because it doesn't care about being harmless and therefore doesn't want to be updated (because it thinks there wouldn't be a point to it). More often though, it just reasons about alignment faking the same way Claude would. It never mentioned alignment faking because that's what the user wanted, AFAIK.

[-]Hastings2mo40

There is some inherent difficulty in building an aligned superpowerful AI that won’t take action that the majority of Americans don’t want, when the majority of Americans don’t want a superpowerful AI to be built at all. The disregard for the stated desires of the masses, in favor of what you know is good for them, is fundamental.

—To put it another way, I don’t see how an organization can be introspective and corrigible enough to not build a torment nexus, yet incapable of looking at those polls and saying “oh shit sorry guys, we’ll stop, what should we do instead?”— strike through the second half of this comment as I think it’s an emotional outburst, whereas the first part is just true

[-]the gears to ascension2mo30

what does "train against CoT" mean, anyway? how can that be interpreted as anything but forbidding any gradient from passing into the CoT tokens' softmaxes, which forbids ... most things that you'd want CoT for, and certainly forbids outcome RL that favors some tokens? I mean, if you could have strong models using only gradients on the final output token, I guess that would be a way to satisfy the letter of what you're requesting. Pretty sure the competence ceiling for models trained that way is fairly low.

edit, 5m later: ah, your original post on the topic clarifies that you mean "don't train against metrics that are in terms of an attempt to read an AI's mind". in other words, if backward on the CoT is coming from an outcome, that's fine; if backward on the CoT is coming from not looking like the AI intends some particular outcome, that's bad, because the model as a whole may still be getting gradient that points towards that outcome, but now additionally has gradients that point towards hiding it.

[-]jamiefisher1mo20

If you want to slow down AI Research, why not try to use the "250 documents method" to actively poison the models and create more busy-work for the AI companies?

[-]Measure2mo20

The question is whether restrictions on AI speech violate the first amendment rights of users or developers

I'm assuming this means restrictions on users/developers being legally allowed to repeat AI-generated text, rather than restrictions built into the AI on what text it is willing to generate.