Usually lower numbers go on the left and bigger numbers go on the right (1, 2, 3,…) so seems reasonable to have it this way.
RL generalization is controlled by why the policy took an action
Is this that good a framing for these experiments? Just thinking out loud:
Distinguish two claims
These experiments seem to test (1), while the claim from your old RL posts is more like (2).
You might want to argue that the claims are actually very similar, but I suspect that someone who disagrees with the quoted statement would believe (1) despite not believing (2). To convince such people we’d have to test (2) directly, or argue that (1) and (2) are very similar.
As for whether the claims are very similar…. I’m actually not sure they are (I changed my mind while writing this comment).
Re: (1), it’s clear that when you get an answer right using a particular line of reasoning, the gradients point towards using that line of reasoning more often. But re: (2), the gradients point towards whatever is the easiest way to get you to produce those output tokens more often, which could in principle be via a qualitatively different computation to the one that actually caused you to output them this time.
So the two claims are at least somewhat different. (1) seems more strongly true than (2) (although I still believe (2) is likely true to a large extent).
But their setup adds:
1.5. Remove any examples in which the steering actually resulted in the desired behaviour.
which is why it’s surprising.
the core prompting experiments were originally done by me (the lead author of the paper) and I'm not an Anthropic employee. So the main results can't have been an Anthropic PR play (without something pretty elaborate going on).
Well, Anthropic chose to take your experiments and build on and promote them, so that could have been a PR play, right? Not saying I think it was, just doubting the local validity.
Sorry. can’t remember. Something done virtually, maybe during Covid.
It's the last sentence of the first paragraph of section 1.
Not sure how to think about this overall. I can come up with examples where it seems like you should assign basically full credit for sloppy or straightforwardly wrong statements.
E.g. suppose Alice claims that BIC only make black pens. Bob says, "I literally have a packet of blue BIC pens in my desk drawer. We will go to my house, open the drawer, and you will see them." They go to Bob's house, and lo, the desk drawer is empty. Turns out the pens are on the kitchen table instead. Clearly it's fine for Bob to say, "All I really meant was that I had blue pens at my house, the point stands."
I think your mention of motte-and-baileys probably points at the right refinement: maybe it's fine to be sloppy if the version you later correct yourself to has the same implications as what you literally said. But if you correct yourself to something easier to defend but that doesn't support your initial conclusion to the same extent, that's bad.
EDIT: another important feature of the pens example is that the statement Bob switched to is uncontroversially true. If on finding the desk drawer empty he instead wanted to switch to, "I left them at work", then probably he should pause and admit a mistake first.
Independently of the broader point, here's some comments on the particular example from the Scientist AI paper (source: I am an author):
I’m confused by the “necessary and sufficient” in “what is the minimum necessary and sufficient policy that you think would prevent extinction?”
Who’s to say there exists a policy which is both necessary and sufficient? Unless we mean something kinda weird by “policy” that can include a huge disjunction (e.g. “we do any of the 13 different things I think would work”) or can be very vague (e.g “we solve half A of the problem and also half B of the problem”).
It would make a lot more sense in my mind to ask “what is a minimal sufficient policy that you think would prevent extinction”?