Lukas Finnveden

Previously "Lanrian" on here. Research analyst at Open Philanthropy. Views are my own.


Extrapolating GPT-N performance

Wiki Contributions


Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below.

Hm, at the scale of "(inter-)national policy", I think you can get quite large effect sizes. I don't know large the effect-sizes of the following are, but I wouldn't be surprised by 10x or greater for:

  • Regulation of nuclear power leading to reduction in nuclear-related harms. (Compared to a very relaxed regulatory regime.)
  • Regulation of pharmaceuticals leading to reduced side-effects from drugs. (Compared to a regime where people can mostly sell what they want, and drugs only get banned after people notice that they're causing harm.)
  • Worker protection standards. (Wikipedia claims that the Netherlands has a ~17x lower rate of fatal workplace accidents than the US, which is ~22x lower than India.) I don't know what's driving the differences here, but the difference between the US and Netherlands suggests that it's not all "individuals can afford to take lower risks in richer countries".

Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?

But most of the deficiencies you point out in the third column of that table is about missing and insufficient risk analysis. E.g.:

  • "RSPs doesn’t argue why systems passing evals are safe".
  • "the ISO standard asks the organization to define risk thresholds"
  • "ISO proposes a much more comprehensive procedure than RSPs"
  • "RSPs don’t seem to cover capabilities interaction as a major source of risk"
  • "imply significant chances to be stolen by Russia or China (...). What are the risks downstream of that?"

If people took your proposal as a minimum bar for how thorough a risk management proposal would be, before publishing, it seems like that would interfere with labs being able to "post the work they are doing as they do it, so people can give feedback and input".

This makes me wonder: Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."

But even after that, Caroline didn’t turn on Sam yet.

This should say Constance.

Instead, ARC explicitly tries to paint the moratorium folks as "extreme".

Are you thinking about this post? I don't see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?

That would mean that believed he had a father with the same reasons, who believed he had a father with the same reasons, who believed he had a father with the same reasons...

I.e., this would require an infinite line of forefathers. (Or at least of hypothetical, believed-in forefathers.)

If anywhere there's a break in the chain — that person would not have FDT reasons to reproduce, so neither would their son, etc.

Which makes it disanalogous from any cases we encounter in real life. And makes me more sympathetic to the FDT reasoning, since it's a stranger case where I have less strong pre-existing intuitions.

Cool paper!

I'd be keen to see more examples of the paraphrases, if you're able to share. To get a sense of the kind of data that lets the model generalize out of context. (E.g. if it'd be easy to take all 300 paraphrases of some statement (ideally where performance improved) and paste in a google doc and share. Or lmk if this is on github somewhere.)

I'd also be interested in experiments to determine whether the benefit from paraphrases is mostly fueled by the raw diversity, or if it's because examples with certain specific features help a bunch, and those occasionally appear among the paraphrases. Curious if you have a prediction about that or if you already ran some experiments that shed some light on this. (I could have missed it even if it was in the paper.)

This is interesting — would it be easy to share the transcript of the conversation? (If it's too long for a lesswrong comment, you could e.g. copy-paste it into a google doc and link-share it.)

You might want to check out the paper and summary that explains ECL, that I linked. In particular, this section of the summary has a very brief introduction to non-causal decision theory, and motivating evidential decision theory is a significant focus in the first couple of sections of the paper.

Load More