TrojanNet: Embedding Hidden Trojan Horse Models in Neural Networks

TL;DR: 

TrojanNet embeds a neural network within another neural network, and it's NP-hard to detect - which limits transparency.

Abstract:

The complexity of large-scale neural networks can lead to poor understanding of their internal details. We show that this opaqueness provides an opportunity for adversaries to embed unintended functionalities into the network in the form of Trojan horses. Our novel framework hides the existence of a Trojan network with arbitrary desired functionality within a benign transport network. We prove theoretically that the Trojan network’s detection is computationally infeasible and demonstrate empirically that the transport network does not compromise its disguise. Our paper exposes an important, previously unknown loophole that could potentially undermine the security and trustworthiness of machine learning.

(found on FB and apparently suggested by old-timer XiXiDu)

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 6:48 AM

I worry a little bit about this == techniques which let you hide circuits in neural networks.  These "hiding techniques" are a riposte to techniques based on modularity or clusterability -- techniques that explore naturally emergent patterns.[1]  In a world where we use alignment techniques that rely on internal circuitry being naturally modular, trojan horse networks can avoid various alignment techniques.

I expect this to happen by default for a bunch of reasons.  An easy one to point to is the "free software" + "crypto anarchist" + "fuck your oversight" + "digital privacy" cluster -- which will probably argue that the government shouldn't infringe your right to build and run neural networks that are not-aligned.  Similar to how encrypting personal emails subverts attempts to limit harms of email, "encrypting neural network functions" can subvert these alignment techniques.

To me, it seems like this system ends up like a steganography equilibrium -- people find new hiding techniques, and others find new techniques to discover hidden information.  As long as humans are both sides of this, I expect there to be progress on both sides.  In the human-vs-human version of this, I think it's not too unevenly matched.

In cases where its human-vs-AI, I strongly expect the AI wins in the limit.  This is in part why I'm optimistic about things like ELK solutions which might be better at finding "hidden modules" not just naturally occurring ones. 

 

  1. ^

    The more time I spend with these, the more I think the idea of naturally occuring modularity or clusterability makes sense / seems likely, which has been a positive update for me