3 Censoring out-of-domain representations

by orthonormal

1st Feb 2017

AI Alignment Forum

2 min read

5

3 Ω 2

Personal Blog

3 Ω 2

New Comment

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:39 PM

[-]Stuart_Armstrong9yΩ230

Seems interesting, but the adversary seems to need a very specific definition of what's outside the domain. Absent that, this just becomes a patch or a nearest unblocked strategy: the solution will the one that's best in the domain and doesn't trigger the specific outside-domain adversary.

Reply

[-]IAFF-User-1119yΩ000

I agree... if there are specific things you don't want to be able to do / predict, then you can do something very similar to the cited "Censoring Representations" paper.

But if you want to censor all "out-of-domain" knowledge, I don't see a good way of doing it.

Reply

[-]orthonormal9yΩ000

Yup, this isn't robust to extremely capable systems; it's a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.

(In the example with the agent doing engineering in a sandbox that doesn't include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)

A whitelisting variant would be way more reliable than a blacklisting one, clearly.

Reply

[-]William_S8yΩ120

A way to achieve whitelisting might be:

Establish a censorship scheme that could reliably censor all knowledge that the agent has. (This might be somewhat tricky, possible approaches:

If the agent has a fixed input channel, censor the agent from reliably predicting anything about the state of the input channel, present past or future.
If the agent has a fixed output channel, censor the output of the agent from being distinguishable from a set of randomly generated bits (the censor network is a discriminator that tries to tell the difference between the two, and propagates the censor gradient to the agent)

But censor the censor network from producing output containing knowledge about any information relevant to the whitelisted domain.
The agent should then not be censored from knowing about anything related to the whitelisted domain.

This will run into issues about the scope implied by the whitelisted domain data set (certain datasets might imply too small or too large of a domain being relevant, and this might be tricky to know in advance).

Reply

[-]William_S9yΩ120

One example that could be tested is a translation system that translates both ways between language pairs (A,B) and (B,C), and that by default allows for zero-shot translation between A and C (as in https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html). Then apply this method. One way would be to inhibit learning translation between A and C (you could try inhibiting only A->C to see if it prevents C->A, or introduce additional language pairs into the setup).

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

3

Censoring out-of-domain representations

3

Ω 2

3

Ω 2