Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(An "exfohazard" is information that's not dangerous to know for individuals, but which becomes dangerous if widely known. Chief example: AI capability insights.)

Different alignment researchers have a wide array of different models of AI Risk, and one researcher's model may look utterly ridiculous to another researcher. Somewhere in the concept-space there exists the correct model, and the models of some of us are closer to it than those of others. But, taking on a broad view, we don't know which of us are more correct. ("Obviously it's me, and nearly everyone else is wrong," thought I and most others reading this.)

Suppose that you're talking to some other alignment researcher, and they share some claim X they're concerned may be an exfohazard. But on your model, X is preposterous. The "exfohazard" is obviously wrong, or irrelevant, or a triviality everyone already knows, or some vacuous philosophizing. Does that mean you're at liberty to share X with others? For example, write a LW post refuting the relevance of X to alignment?

Well, consider system-wide dynamics. Somewhere between all of us, there's a correct-ish model. But it looks unconvincing to nearly everyone else. If every alignment researcher feels free to leak information that's exfohazardous on someone else's model but not on their own, then either:

  • The information that's exfohazardous relative to the correct model ends up leaked as well, OR
  • Nobody shares anything with anyone outside their cluster. We're all stuck in echo chambers.

Both seem very bad. Adopting the general policy of "don't share information exfohazardous on others' models, even if you disagree with those models" prevents this.

However, that policy has an issue. Imagine if some loon approaches you on the street and tells you that you must never talk about birds, because birds are an exfohazard. Forever committing to avoid acknowledging birds' existence in conversations because of this seems rather unreasonable.

Hence, the policy should have an escape clause: You should feel free to talk about the potential exfohazard if your knowledge of it isn't exclusively caused by other alignment researchers telling you of it. That is, if you already knew of the potential exfohazard, or if your own research later led you to discover it.

This satisfies a nice property: it means that someone telling you an exfohazard doesn't make you more likely to spread it. I. e., that makes you (mostly) safe to tell exfohazards to[1].

That seems like a generally reasonable policy to adopt, for everyone who's at all concerned about AI Risk.

  1. ^

    Obviously there's the issue of your directly using the exfohazard to e. g. accelerate your own AI research.

    Or the knowledge of it semi-consciously influencing you to follow some research direction that leads to your re-deriving it, which ends up making you think that your knowledge of it is now independent of the other researcher having shared it with you; while actually, it's not independent. So if you share it, thinking the escape clause applies, they will have made a mistake (from their perspective) by telling you.

    Still, mostly safe.

New to LessWrong?

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 8:12 PM

This is a good (and pretty standard) policy for any secret shared with you, of  which "exfohazard" is, if not synonymous, a subcategory.

Generally true. But in some situations, exfohazards can look unlike most people's central conception of a "secret", so I think it's still worth stating explicitly.

When you share a "normal" personal secret, you own it to some extent. Its secrecy and sensitivity are caused by your being uncomfortable with sharing it. So people naturally understand that they need your buy-in to share it.

Conversely, an exfohazard can often be perceived as "objectively-justified" secret, as knowledge that's dangerous inherently, not just because you think/feel it is. That might give someone the impression that if they disagree with your model, then they can disregard the supposed sensitivity of this secret as well. You're not the sole arbiter of that, after all, reality is. And if their model of reality disagrees that this is sensitive information...?

In addition, there's not just your feelings at stake, but national policies/fate of the world. Disregarding a "soft" secret like this may seem like it's worth it for some people. I'm pointing out it'd have bad objective effects as well, not just subjective hurt feelings.

I would agree vote, if that were an option.

You can select specific lines of text that you agree with and use inline reacts to express your agreement!

Hence, the policy should have an escape clause: You should feel free to talk about the potential exfohazard if your knowledge of it isn't exclusively caused by other alignment researchers telling you of it. That is, if you already knew of the potential exfohazard, or if your own research later led you to discover it.

In an ideal world, it's good to relax this clause in some way, from a binary to a spectrum. For example: if someone tells me of a hazard that I'm confident I would've discovered one my own one week later, then they only get to dictate me not-sharing-it for a week. "Knowing" isn't a strict binary; anyone can rederive anything with enough time (maybe) — it's just a question of how long it would've taken me to find it if they didn't tell me. This can even include someone bringing my attention to something I already knew, but to which I wouldn't as quickly have thought to pay attention if they didn't bring attention to it.

In the non-ideal world we inhabit, however, it's unclear how fraught it is to use such considerations.

This is a great point. I also notice that a decent number of people's risk models change frequently with various news, and that's not ideal either, as it makes them less likely to stick with a particular approach that depends on some risk model. In an ideal world we'd have enough people pursuing enough approaches with most possible risk models that it's make little sense for anyone to consider switching. Maybe the best we can approximate now is to discuss this less.

“Don't share information that’s exfohazardous on others' models, even if you disagree with those models, except if your knowledge of it isn’t exclusively caused by other alignment researchers telling you of it.”

So if Alice tells me about her alignment research, and Bob thinks that Alice’s alignment research is exfohazardous, then I can’t tell people about Alice’s alignment research?

Unless I’ve misunderstood you, that’s a terrible policy.

Why am I deferring to Bob, who is completely unrelated? Why should I not using my best judgement, which includes the consideration that Bob is worried? What does this look like in practice, given someone people think everything under the sun is exfohazardous?

Of course, if someone tells me some information and asks me not to share it then I won’t — but that’s not a special property of AI xrisk.

Pretty sure that's what the "telling you of it" part fixes. Alice is the person who told you of Alice's hazards, so your knowledge is exclusively caused by Alice, and Alice is the person whose model dictates whether you can share them.

yep, if that's OP's suggestion then I endorse the policy. (But I think it'd be covered by the more general policy of "Don't share information someone tells you if they wouldn't want you to".) But my impression is that OP is suggesting the stronger policy I described?

No, Tamsin's interpretation is correct.

Natural language is prone to ambiguous interpretations, and I'd tried to rephrase the summary a few times to avoid them. Didn't spot that one.

Okay, mea culpa. You can state the policy clearly like this:

"Suppose that, if you hadn't been told  by someone who thinks  is exfohazardous, then you wouldn't have known  before time . Then you are obligated to not tell anyone  before time ."