I’m confused about the back door attack detection task even after reading it a few times:
The article says: “The key difference in the attack detection task is that you are given the backdoor input along with the backdoored model, and merely need to recognize the nput as an attack”.
When I read that, I find myself wondering why that isn’t trivial solved by a model that memorises which input(s) are known to be an attack.
My best interpretation is that there are a bunch of possible inputs that cause an attack and you are given one of them and just have to recognise that one plus the others you don’t see. Is this interpretation correct?
This is a good post, so I’d definitely encourage you to write up a few more posts.
I know very little about you, so it’d be hard for me to make good suggestions, but here’s two possibilities for your consideration:
Excellent post. One part I disagree with though:
“ If you know anybody in politics anywhere, it might be a good idea to try and convince them to pay attention to this AGI thing” - It wouldn’t surprise me if this was net-negative and the default outcome of informing actors about AGI is for them to attempt to accelerate it.
Another part I’d disagree with is lionising technical researchers over everyone else.
What does the word swap add? Isn't the human just going to swap the words back as part of the reconstruction? Or are you betting on the rare cases where words can be written in any order, ie: "black round" instead of "round black"?
It would also be nice to have a better idea of how the humans are supposed to be rewriting the plan. I suspect the best way to do this would be to provide an actual example of a plan being reconstructed by a human. One particular aspect I would like clarity on: how do you see the details of a plan being changed when the worry is that the plan is subtly off? Do you see this as occurring accidentally during reconstruction or is the idea that humans should intentionally change the details?
What is the Rose Garden Inn used for these days?