Previously "Lanrian" on here. Research analyst at Open Philanthropy. Views are my own.
Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?
But most of the deficiencies you point out in the third column of that table is about missing and insufficient risk analysis. E.g.:
If people took your proposal as a minimum bar for how thorough a risk management proposal would be, before publishing, it seems like that would interfere with labs being able to "post the work they are doing as they do it, so people can give feedback and input".
This makes me wonder: Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."
That would mean that believed he had a father with the same reasons, who believed he had a father with the same reasons, who believed he had a father with the same reasons...
I.e., this would require an infinite line of forefathers. (Or at least of hypothetical, believed-in forefathers.)
If anywhere there's a break in the chain — that person would not have FDT reasons to reproduce, so neither would their son, etc.
Which makes it disanalogous from any cases we encounter in real life. And makes me more sympathetic to the FDT reasoning, since it's a stranger case where I have less strong pre-existing intuitions.
Cool paper!
I'd be keen to see more examples of the paraphrases, if you're able to share. To get a sense of the kind of data that lets the model generalize out of context. (E.g. if it'd be easy to take all 300 paraphrases of some statement (ideally where performance improved) and paste in a google doc and share. Or lmk if this is on github somewhere.)
I'd also be interested in experiments to determine whether the benefit from paraphrases is mostly fueled by the raw diversity, or if it's because examples with certain specific features help a bunch, and those occasionally appear among the paraphrases. Curious if you have a prediction about that or if you already ran some experiments that shed some light on this. (I could have missed it even if it was in the paper.)
This is interesting — would it be easy to share the transcript of the conversation? (If it's too long for a lesswrong comment, you could e.g. copy-paste it into a google doc and link-share it.)
You might want to check out the paper and summary that explains ECL, that I linked. In particular, this section of the summary has a very brief introduction to non-causal decision theory, and motivating evidential decision theory is a significant focus in the first couple of sections of the paper.
Hm, at the scale of "(inter-)national policy", I think you can get quite large effect sizes. I don't know large the effect-sizes of the following are, but I wouldn't be surprised by 10x or greater for: