Don't Share Information Exfohazardous on Others' AI-Risk Models

[-]Shankar Sivarajan2y138

This is a good (and pretty standard) policy for any secret shared with you, of which "exfohazard" is, if not synonymous, a subcategory.

[-]Thane Ruthenis2y30

Generally true. But in some situations, exfohazards can look unlike most people's central conception of a "secret", so I think it's still worth stating explicitly.

When you share a "normal" personal secret, you own it to some extent. Its secrecy and sensitivity are caused by your being uncomfortable with sharing it. So people naturally understand that they need your buy-in to share it.

Conversely, an exfohazard can often be perceived as "objectively-justified" secret, as knowledge that's dangerous inherently, not just because you think/feel it is. That might give someone the impression that if they disagree with your model, then they can disregard the supposed sensitivity of this secret as well. You're not the sole arbiter of that, after all, reality is. And if their model of reality disagrees that this is sensitive information...?

In addition, there's not just your feelings at stake, but national policies/fate of the world. Disregarding a "soft" secret like this may seem like it's worth it for some people. I'm pointing out it'd have bad objective effects as well, not just subjective hurt feelings.

[-]Eli Tyre2y60

I would agree vote, if that were an option.

[-]habryka2y70

You can select specific lines of text that you agree with and use inline reacts to express your agreement!

[-]Tamsin Leake2yΩ352

Hence, the policy should have an escape clause: You should feel free to talk about the potential exfohazard if your knowledge of it isn't exclusively caused by other alignment researchers telling you of it. That is, if you already knew of the potential exfohazard, or if your own research later led you to discover it.

In an ideal world, it's good to relax this clause in some way, from a binary to a spectrum. For example: if someone tells me of a hazard that I'm confident I would've discovered one my own one week later, then they only get to dictate me not-sharing-it for a week. "Knowing" isn't a strict binary; anyone can rederive anything with enough time (maybe) — it's just a question of how long it would've taken me to find it if they didn't tell me. This can even include someone bringing my attention to something I already knew, but to which I wouldn't as quickly have thought to pay attention if they didn't bring attention to it.

In the non-ideal world we inhabit, however, it's unclear how fraught it is to use such considerations.

[-]Cleo Nardo2y40

“Don't share information that’s exfohazardous on others' models, even if you disagree with those models, except if your knowledge of it isn’t exclusively caused by other alignment researchers telling you of it.”

So if Alice tells me about her alignment research, and Bob thinks that Alice’s alignment research is exfohazardous, then I can’t tell people about Alice’s alignment research?

Unless I’ve misunderstood you, that’s a terrible policy.

Why am I deferring to Bob, who is completely unrelated? Why should I not using my best judgement, which includes the consideration that Bob is worried? What does this look like in practice, given someone people think everything under the sun is exfohazardous?

Of course, if someone tells me some information and asks me not to share it then I won’t — but that’s not a special property of AI xrisk.

[-]Tamsin Leake2y62

Pretty sure that's what the "telling you of it" part fixes. Alice is the person who told you of Alice's hazards, so your knowledge is exclusively caused by Alice, and Alice is the person whose model dictates whether you can share them.

[-]Cleo Nardo2y64

yep, if that's OP's suggestion then I endorse the policy. (But I think it'd be covered by the more general policy of "Don't share information someone tells you if they wouldn't want you to".) But my impression is that OP is suggesting the stronger policy I described?

[-]Thane Ruthenis2y40

No, Tamsin's interpretation is correct.

Natural language is prone to ambiguous interpretations, and I'd tried to rephrase the summary a few times to avoid them. Didn't spot that one.

[-]Cleo Nardo2y40

Okay, mea culpa. You can state the policy clearly like this:

"Suppose that, if you hadn't been told by someone who thinks $X$ is exfohazardous, then you wouldn't have known $X$ before time $t$ . Then you are obligated to not tell anyone $X$ before time $t$ ."

[-]Judd Rosenblatt2y30

This is a great point. I also notice that a decent number of people's risk models change frequently with various news, and that's not ideal either, as it makes them less likely to stick with a particular approach that depends on some risk model. In an ideal world we'd have enough people pursuing enough approaches with most possible risk models that it's make little sense for anyone to consider switching. Maybe the best we can approximate now is to discuss this less.

^{^}

Obviously there's the issue of your directly using the exfohazard to e. g. accelerate your own AI research.

Or the knowledge of it semi-consciously influencing you to follow some research direction that leads to your re-deriving it, which ends up making you think that your knowledge of it is now independent of the other researcher having shared it with you; while actually, it's not independent. So if you share it, thinking the escape clause applies, they will have made a mistake (from their perspective) by telling you.

Still, mostly safe.

LESSWRONG
LW

LESSWRONG
LW

68

Don't Share Information Exfohazardous on Others' AI-Risk Models

68

Ω 25

68

Ω 25