Jozdien — LessWrong

First, models can't control the words in their reasoning traces well; even if they're trying to hide them they often blurt out "I must not say that I'm doing X".

All of these results are on settings where models are explicitly instructed to hide something from their reasoning trace. I don't think this is a good proxy for how well models can actually obfuscate their reasoning. I expect that in settings where models are intrinsically motivated to hide their reasoning as opposed to being instructed to do so, they perform significantly better. I have some results suggesting this, which I'd be happy to share, but there's also existing work suggesting this indirectly (though in the case of the latter link probably not in a CoT-as-computation regime). And the cases we care about are the ones where models have intrinsic preferences.

models are no longer trained to narrate their thoughts
hiding misaligned thoughts is rewarded during training
the behaviors we're trying to catch are possible in one forward pass for the model

The first two of these seem true in a relevant sense today?

Models are clearly not trained to narrate all of their thoughts when solving some task, simply because that would be too much information. We train them to be pretty verbose, but it's a pretty strong assumption that that is similar enough to being trained to narrate their thoughts.
Misaligned outputs are penalized during training. There's work showing that optimizing outputs spills over to the CoT, so this really comes down to "hiding" misaligned thoughts as opposed to simply never having them.

Linch's Shortform

Jozdien1mo177

The community overall seems more tolerant of post-rationality and "woo" than I would've expected the standard-bearers of rationality to be.

Could you say more about what you're referring to? One of my criticisms of the community is how it's often intolerant of things that pattern-match to "woo", so I'm curious whether these overlap.

Training a Reward Hacker Despite Perfect Labels

Jozdien2mo40

They do test 4.1 and 4.1-mini which have high rates of reward hacking (though they test using data from 4o-mini), and find that reward hacking goes down:

We train different models in the OpenAI family on non-hacks generated by 4o-mini. We find re-contextualized training:
Boosts hacking for GPT-4o.
Decreases hacking for GPT-4.1-mini and GPT-4.1, which start out at high rates.
Nevertheless, re-contextualized training results in strictly higher hack rates than standard training for every base model.

Training a Reward Hacker Despite Perfect Labels

Jozdien2mo92

They suggest this isn't purely subliminal learning here:

To shed light on what is going on, we ask: to what extent would this re-contextualized data boost hacking on a different base model? If the effect is largely subliminal, we won't see transfer to a different base model.
[...]
Nevertheless, re-contextualized training results in strictly higher hack rates than standard training for every base model. This means that the effect is not purely subliminal.

It's possible that subliminal learning does work across base models for some traits, but I also agree that these results aren't fully explained by subliminal learning (especially since even the filtered reasoning traces here contain generally-interpretable semantic linkage to test-passing, such as discussing the test cases in detail).

evhub's Shortform

Jozdien2moΩ390

If I understand correctly, the program still isn't open to people who don't have work authorization in the US or the UK, right?

Igor Ivanov's Shortform

Jozdien2mo70

I don't think it needs to look like a coherent long-term plan from the model's perspective. Awareness of CoTs being monitored and a consistent desire to do something that would be bad if monitored could influence the model's outputs without requiring a coherent long-term plan.

Saying Goodbye

Jozdien2mo92

There have been a number of instances recently of some EAs on Twitter having their accounts hacked for crypto scams. Possibly that's what's being referred to?

HPMOR: The (Probably) Untold Lore

Jozdien2mo811

That would depend on whether he actively considers it as something to rely on, as opposed to an assumption so baked in he forgets to question it, right? If questioned I think Quirrell would rightfully consider the Chamber to be something critical enough to be worth having other contingencies for, but he just never considered it necessary.

HPMOR: The (Probably) Untold Lore

Jozdien2mo1415

The question I always had here was “But what was Voldemort’s original plan for dealing with this issue when he decided to teach at Hogwarts?”

My guess was that he never considered the possibility that Harry would do something like report it to an authority figure. For example, consider this from chapter 49:

Professor Quirrell was smiling thinly now. "Well now, let us suppose that there was a Parselmouth in this year's crop of students, a potential Heir of Slytherin. You must admit, Mr. Potter, that you stand out as a possibility whenever extraordinary people are considered. And if I then further ask myself which new Slytherin would be most likely to have his mental privacy invaded by the Headmaster, specifically hunting the memories of his Sorting, why, you stand out even more." The smile vanished. "So you see, Mr. Potter, it was not I who invaded your mind, though I will not ask you to apologize. It is not your fault that you believed Dumbledore's protestations of respecting your mental privacy."
"My sincere apologies," Harry said, keeping his face expressionless. The rigid control was a confession in its own right, as was the sweat beading his forehead; but he didn't think the Defense Professor would take any evidence from that. Professor Quirrell would just think Harry was nervous at having been discovered as the Heir of Slytherin. Rather than being nervous that Professor Quirrell might realize that Harry had deliberately betrayed Slytherin's secret... which itself was no longer seeming like such a smart move.

Quirrell never even pauses to consider that Dumbledore may know about it because Harry told him; it doesn't show up in his action space at all in modelling a younger version of himself.

Evaluating and monitoring for AI scheming

Jozdien3mo40

This seems to imply that your position that it would be easier for models to evade monitors when there is "complex reasoning for many outputs" than when there is "complex reasoning for a single output". Is that correct? If so, why do you think that?

No, my position is that it's probably harder for models to evade monitors when there is "complex reasoning for many outputs". Sorry, that wasn't written clearly—that was meant to say that my top comment didn't draw the distinction between tasks with a single output and tasks with many outputs (and instead only mentioned examples of other tasks with single outputs) because the paper tests settings closer to single outputs.

The last sentence was intended to be about my top comment in light of that: if the stated conclusion had been something like "difficult bad behavior is easy to monitor when models are ambivalent about their reasoning being monitored" or "difficult bad behavior is easy to monitor for current models in tasks requiring complex reasoning for many outputs regardless of propensity", I would have agreed^[1]. But the stated conclusion is instead that difficult bad behaviors are generally easy to monitor, which I don't think the results are sufficient to imply.

^{^}
Though in the latter case it'd probably be important to plot number of outputs vs monitor-ability and compare for models over time.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments