Question marks and exclamation points go outside, unless they’re part of the sentence, and colons and semicolons always go outside.
An example of a sentence from the post that uses a colon is “It gets deeper than this, but the core problem remains the same”:
Surely nobody endorses putting the colon outside the quotes there! I feel like Opus/whoever is just assuming that people virtually never want to quote a piece of text that ends in a colon, rather than really wanting to endorse a different rule to the question mark case.
I’m confused by the “necessary and sufficient” in “what is the minimum necessary and sufficient policy that you think would prevent extinction?”
Who’s to say there exists a policy which is both necessary and sufficient? Unless we mean something kinda weird by “policy” that can include a huge disjunction (e.g. “we do any of the 13 different things I think would work”) or can be very vague (e.g “we solve half A of the problem and also half B of the problem”).
It would make a lot more sense in my mind to ask “what is a minimal sufficient policy that you think would prevent extinction”?
Usually lower numbers go on the left and bigger numbers go on the right (1, 2, 3,…) so seems reasonable to have it this way.
RL generalization is controlled by why the policy took an action
Is this that good a framing for these experiments? Just thinking out loud:
Distinguish two claims
These experiments seem to test (1), while the claim from your old RL posts is more like (2).
You might want to argue that the claims are actually very similar, but I suspect that someone who disagrees with the quoted statement would believe (1) despite not believing (2). To convince such people we’d have to test (2) directly, or argue that (1) and (2) are very similar.
As for whether the claims are very similar…. I’m actually not sure they are (I changed my mind while writing this comment).
Re: (1), it’s clear that when you get an answer right using a particular line of reasoning, the gradients point towards using that line of reasoning more often. But re: (2), the gradients point towards whatever is the easiest way to get you to produce those output tokens more often, which could in principle be via a qualitatively different computation to the one that actually caused you to output them this time.
So the two claims are at least somewhat different. (1) seems more strongly true than (2) (although I still believe (2) is likely true to a large extent).
But their setup adds:
1.5. Remove any examples in which the steering actually resulted in the desired behaviour.
which is why it’s surprising.
the core prompting experiments were originally done by me (the lead author of the paper) and I'm not an Anthropic employee. So the main results can't have been an Anthropic PR play (without something pretty elaborate going on).
Well, Anthropic chose to take your experiments and build on and promote them, so that could have been a PR play, right? Not saying I think it was, just doubting the local validity.
Sorry. can’t remember. Something done virtually, maybe during Covid.
It's the last sentence of the first paragraph of section 1.
I thought the post was fine and was surprised it was so downvoted. Even if people don’t agree with the considerations, or think all the most important considerations are missing, why should a post saying, “Here’s what I think and why I think it, feel free to push back in the comments,” be so poorly received? Commenters can just say what they think is missing.
Seems likely that it wouldn’t have been so downvoted if its bottom line was that AI risk is very high. Increases my P(LW groupthink is a problem) a bit.