I agree with this norm, though I think it would be better to say that the "burden of evidence" should be on labs. When I first read the title, I thought you wanted labs to somehow prove the safety of their system in a conclusive way. What this probably looks like in practice is "we put x resources into red teaming and didn't find any problems." I would be surprised if 'proof' was ever an appropriate term.
The analogy between AI safety and math or physics is assumed it in a lot of your writing and I think it is a source of major disagreement with other thinkers. ML capabilities clearly isn’t the kind of field that requires building representations over the course of decades.
I think it’s possible that AI safety requires more conceptual depth than AI capabilities; but in these worlds, I struggle to see how the current ML paradigm coincides with conceptual ‘solutions’ that can’t be found via iteration at the end. In those worlds, we are probably doomed so I’m betting on the worlds in which you are wrong and we must operate within the current empirical ML paradigm. It’s odd to me that you and Eliezer seem to think the current situation is very intractable, and yet you are confident enough in your beliefs to where you won’t operate on the assumption that you are wrong about something in order to bet on a more tractable world.
+1. As a toy model, consider how the expected maximum of a sample from a heavy tailed distribution is affected by sample size. I simulated this once and the relationship was approximately linear. But Soares’ point still holds if any individual bet requires a minimum amount of time to pay off. You can scalably benefit from parallelism while still requiring a minimum amount of serial time.
At its core, the argument appears to be "reward maximizing consequentialists will necessarily get the most reward." Here's a counter example to this claim: if you trained a Go playing AI with RL, you are unlikely to get a reward maximizing consequentialist. Why? There's no reason for the Go playing AI to think about how to take over the world or hack the computer that is running the game. Thinking this way would be a waste of computation. AIs that think about how to win within the boundaries of the rules therefore do better.
In the same way, if you could robustly enforce rules like "turn off when the humans tell you to" or "change your goal when humans tell you to" etc, perhaps you end up with agents that follow these rules rather than agents that think "hmmm... can I get away with being disobedient?"
Both achieve the same reward if the rules are consistently enforced during training and I think there are weak reasons to expect deontological agents to be more likely.
I don't mean to distract from your overall point though which I take to be "a philosopher said a smart thing about AI alignment despite not having much exposure." That's useful data.
I don't know why this is a critique of RLHF. You can use RLHF to train a model to ask you questions when it's confused about your preferences. To the extent to which you can easily identify when a system has this behavior and when it doesn't, you can use RLHF to produce this behavior just like you can use RLHF to produce many other desired behaviors.
Thoughts on John's comment: this is a problem with any method for detecting deception that isn't 100% accurate. I agree that finding a 100% accurate method would be nice, but good luck.
Also, you can somewhat get around this by holding some deception detecting methods out (i.e. not optimizing against them). When you finish training and the held out methods tell you that your AI is deceptive, you start over. Then you have to try to think of another approach that is more likely to actually discourage deception than fool your held out detectors. This is the difference between gradient descent search and human design search, which I think is an important distinction.
Also, FWIW, I doubt that trojans are currently a good microcosm for detecting deception. Right now, it is too easy to search for the trigger using brute force optimization. If you ported this over to sequential-decision-making land where triggers can be long and complicated, that would help a lot. I see a lot of current trojan detection research as laying the groundwork for future research that will be more relevant.
In general, it seems better to me to evaluate research by asking "where is this taking the field/what follow-up research is this motivating?" rather than "how are the words in this paper directly useful if we had to build AGI right now?" Eventually, the second one is what matters, but until we have systems that look more like agents that plan and achieve goals in the real world, I'm pretty skeptical of a lot of the direct value of empirical research.
In addition to seeing more AI behavioral psychology work, I would be excited about seeing more AI developmental psychology -- i.e. studying how varying properties of training or architecture affect AI behavior. Shard theory is an example of this.
I've written a bit about the motivations for AI developmental psychology here.