Responding to the Facebook thread about possible attacks, here is a simple attack that seems worth analyzing/defending against: Set up a number of sockpuppet accounts. Find (or post) a set of comments that you predict the moderator will hide. Use your sockpuppet accounts to upvote/downvote those comments in a fixed pattern (e.g., account 1 always upvotes, account 2 always downvotes, and so on), so as to cause the ML algorithm to associate that pattern of votes with the "hide" moderator action. When you want to cause a comment that you don't like to be hidden, apply the same pattern of votes to it.
Thanks for pointing out this attack. The regret bound implies that adversarially-controlled-features can't hurt predictions much on average, but adversarially-controlled content can also makes the prediction problem harder (by forcing us to make more non-obvious predictions).
Note that in terms of total loss this is probably better than the situation where someone just makes a bunch of spammy posts without bothering with the upvote-downvote pattern: