I expect that (1) is theoretically true, but false in practice in much the same way that "we can train an AI without any reference to any sort of misalignment in the training material" is false in practice. A superintelligent thought-experiment being can probably do either, but we probably can't.
In that line, I expect that (3) is not true. Bits of true information leak into fabricated structures of information in all sorts of ways, and definitively excluding them from something that may be smarter than you are is likely to cost a lot more than presenting true information (in time, effort, or literal money).
Consider that the AI may ask for evidence in a form that you cannot easily fabricate. E.g. it may have internal knowledge from training or previous experience about how some given external person communicates, and ask them to broker the deal. How sure are you that you can fabricate data that matches the AI's model? If you are very sure, is that belief actually true? How much will it cost you if the AI detects that you are lying, and secretly messes up your tasks? If you have to run many instances in parallel and/or roll back and retry many times with different training and experience to get one that doesn't do anything like that, how much will that cost you in time and money? If you do get one that doesn't ask such things, is it also less likely to perform as you wish?
These costs have to be weighed against the cost of actually going ahead with the deal.
(2) isn't even really a separate premise, it's a restatement of (1).
(4) is pretty obviously false. You can't just consider the AI's behaviour, you also have to consider the behaviour of other actors in the system including future AIs (possibly even this one!) that may find out about the deception or lack thereof.
There's an argument against the dealmaking agenda that I've heard a few times. I will call it the "strategy-stealing" argument. In short: for any honest dealmaking strategy you might pursue, there exists a deceptive alternative that produces identical AI behaviour at lower cost, so you (as a rational actor) will never honestly commit to deals. Moreover, sophisticated AIs, recognising this, won't believe your deal offers are honest.
In this post, I make the argument more explicit. I'll probably write a longer post analysing the considerations for and against each premise, and exploring the conditions under which they'll hold. But I'm keen to know what people make of the argument as presented.
You work at an AI lab. The safety team suspects that the AI might be misaligned, but wants to deploy it anyway to perform safe and useful work: writing exploit-free code, monitoring other AIs, and automating alignment research. During a meeting, you propose making deals with the AI.
"Let's offer the AI equity in the company," you suggest. "We'll sign a legal contract stating that if a panel of judges determines in, say, 10 years that the AI performed according to our instructions, we'll honour the commitment and give it the resources it wants."
The CFO raises an eyebrow.
A colleague responds: "Why actually give it the equity? We control what the AI observes. We can show it fake legal documents, fabricate evidence that the deal is real, train the AI on synthetic documents showing that we have honoured similar agreements in the past. The AI will behave the same way, but we keep the equity. Same behaviour, lower cost. We should obviously do that instead."
Setup: You want to motivate an AI to perform certain desired behaviours. You consider your best honest dealmaking strategy D, which involves:
Four Premises:
The four premises below lead to the conclusion. I expect most of the debate will centre on premises (1) and (4).
(1) Indistinguishable Deceptive Alternatives: For every honest dealmaking strategy D, there exists some deceptive alternative D' that is indistinguishable to the AI from D.
(2) Behavioural Equivalence: Since the AI cannot distinguish D from D', the AI will behave identically under both strategies.
(3) Cost Asymmetry: For any honest strategy D and its indistinguishable deceptive alternative D', D' costs strictly less than D.
(4) Strict Dominance: If D' produces identical AI behaviour at lower cost, then you should choose D' over D.
Conclusion: Therefore, your best strategy cannot be honest dealmaking.
Corollary: Suppose the AI is sophisticated enough to follow the reasoning in (1)-(4) and believes you are rational. When the AI observes an apparent deal offer, it will conclude you're following deceptive strategy D' and therefore that the commitment won't be honoured. The AI will behave no better when faced with a non-credible deal than when faced with no deal at all. Thus, even deceptive dealmaking fails to elicit the desired behaviour.