> My understanding is that Paul thinks breaking the evolution analogy is important, but a lot less difficult than Eliezer thinks it is
My basic take on the evolution analogy:
* Evolution wasn't trying to solve the robustness problem at all. It's analogous to using existing ML while making _zero ...(read more)
> googling around, I wasn't able to quickly find any papers or blog posts supporting it
I think it's a little bit tricky because decision trees don't work that well for the tasks where people usually study adversarial examples. And this isn't my research area so I don't know much about it.
That sa...(read more)
> I guess my point is that there are open questions about how to protect against value drift caused by AI, what the AI should do when the user doesn't have much idea of how they want their values to be pushed around, and how to get the AI to competently help the user with moral questions, which seem...(read more)
> Has anyone spent a timed 5 minutes trying to figure out, say, how vulnerable gcForest is likely to be to adversarial examples?
Yes. (Answer: deep learning is not unusually susceptible to adversarial examples.)
> 5 minutes of research is enough to determine that creating models which "correctly c...(read more)
> Then you could come up with a list of desiderata we seek in a paradigm: resistance to adversarial examples, robustness to distributional shift, interpretability, conservative concepts, calibration, etc.
For most of these examples, the current research in safety is more like "Try to find _any_ app...(read more)
> don't understand how imitation+RL brings Amplification closer to Debate
The default setup for amplification with RL is:
* Your AI samples two answers to a question.
* The human evaluates which one of them is better. The AI's objective is to sample answers that are most likely to be marked as...(read more)
> In the reverse direction amplification mostly seems less adversarial since it's pure supervised learning
Note that you could do amplification with either supervised learning or imitation or RL as the distillation step, in the long run I imagine using imitation+RL, which brings it closer to debate...(read more)
I don't see why to separate 1/2, the goal is to find training data that describes some "universal" core for behavior.
3\. I don't think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent,...(read more)
I don't know what the statement of the theorem would be. I don't really think we'd have a clean definition of "contains daemons" and then have a proof that a particular circuit doesn't contain daemons.
Also I expect we're going to have to make some assumption that the problem is "generic" (or else ...(read more)
Suppose "predict well" means "guess the output with sufficiently high probability," and the noise is just to replace the output with something random 5% of the time.