Previously "Lanrian" on here. Research analyst at Open Philanthropy. Views are my own.
Cool paper!
I'd be keen to see more examples of the paraphrases, if you're able to share. To get a sense of the kind of data that lets the model generalize out of context. (E.g. if it'd be easy to take all 300 paraphrases of some statement (ideally where performance improved) and paste in a google doc and share. Or lmk if this is on github somewhere.)
I'd also be interested in experiments to determine whether the benefit from paraphrases is mostly fueled by the raw diversity, or if it's because examples with certain specific features help a bunch, and those occasionally appear among the paraphrases. Curious if you have a prediction about that or if you already ran some experiments that shed some light on this. (I could have missed it even if it was in the paper.)
This is interesting — would it be easy to share the transcript of the conversation? (If it's too long for a lesswrong comment, you could e.g. copy-paste it into a google doc and link-share it.)
You might want to check out the paper and summary that explains ECL, that I linked. In particular, this section of the summary has a very brief introduction to non-causal decision theory, and motivating evidential decision theory is a significant focus in the first couple of sections of the paper.
Here's a proposed operationalization.
For models that can't gradient hack: The model is "capable of doing X" if it would start doing X upon being fine-tuned to do it using a hypothetical, small finetuning dataset that demonstrated how to do the task. (Say, at most 1000 data points.)
(The hypothetical fine-tuning dataset should be a reasonable dataset constructed by a hypothetical team of human who knows how to do the task but aren't optimizing the dataset hard for ideal gradient updates to this particular model, or anything like that.)
For models that might be able to gradient-hack, but are well-modelled as having certain goals: The model is "capable of doing X" if it would start doing X if doing X was a valuable instrumental goal, for it.
For both kinds: "you can get it to do X" if you could make it do X with some large amount of research+compute budget (say, 1% of the pre-training budget), no-holds-barred.
Edit: Though I think your operationalization also looks fine. I mainly wanted to point out that the "finetuning" definition of "capable of doing X" might be ok if you include the possibility of finetuning on hypothetical datasets that we don't have access to. (Since we only know how to check the task — not perform it.)
Then, conditional on type 1, you're about 0.5% likely to observe being post cold war, and conditional on type 2, you're about 45% likely to observe being post cold war.
I would have thought:
I don't think this makes a substantive difference to the rest of your comment, though.
Under SIA, you start with a ~19:10 ratio in favor of type 2 (in the subjective a priori). The likelihood ratios are the same as with SSA so the posteriors are equally weighted towards type 2. So the updates are of equal magnitude in odds space under SSA and SIA.
Oh, I see. I think I agree that you can see SIA and SSA as equivalent updating procedures with different priors.
Nevertheless, SSA will systematically assign higher probabilities (than SIA) to latent high probabilities of disaster, even after observing themselves to be in worlds where the disasters didn't happen (at least if the multiverse + reference class is in a goldilocks zone of size and inclusivity). I think that's what the anthropic's shadow is about. If your main point is that the action is in the prior (rather than the update) and you don't dispute people's posteriors, then I think that's something to flag clearly. (Again — I apologise if you did something like this in some part of the post I didn't read!)
I think this is an odd choice of reference class, and constructing your reference class to depend on your time index nullifies the doomsday argument, which is supposed to be an implication of SSA. I think choices of reference class like this will have odd reflective behavior because e.g. further cold wars in the future will be updated on by default.
I agree it's very strange. I always thought SSA's underspecified reference classes were pretty suspicious. But I do think that e.g. Bostrom's past writings often do flag that the doomsday argument only works with certain reference classes, and often talks about reference classes that depend on time-indices.
That argument only works for SSA if type 1 and type 2 planets exist in parallel.
I was talking about a model where either every planet in the multiverse is type 1, or every planet in the multiverse is type 2.
But extinction vs non-extinction is sampled separately on each planet.
Then SSA gives you an anthropic shadow.
(If your reference class is "all observers" you still get an update towards type 2, but it's weaker than for SIA. If your reference class is "post-nuclear-weapons observers", then SSA doesn't update at all.)
How to get anthropic shadow:
At this point, I believe you will get some difference between SSA and SIA. For maximizing the size of the shadow, you can add:
(I didn't read the whole post, sorry if you address this somewhere. Also, I ultimately don't agree with the anthropic shadow argument.)
They don't give a factor 5 uncertainty. They add a 100x discount on top of the 20x discount — counting fish suffering as 2000x less important than human suffering.
That would mean that believed he had a father with the same reasons, who believed he had a father with the same reasons, who believed he had a father with the same reasons...
I.e., this would require an infinite line of forefathers. (Or at least of hypothetical, believed-in forefathers.)
If anywhere there's a break in the chain — that person would not have FDT reasons to reproduce, so neither would their son, etc.
Which makes it disanalogous from any cases we encounter in real life. And makes me more sympathetic to the FDT reasoning, since it's a stranger case where I have less strong pre-existing intuitions.