Irresponsible Companies Can Be Made of Responsible Employees
tl;dr: 1. In terms of financial interests of an AI company, bankruptcy and the world ending are both equally bad. If a company acted in line with its financial interests[1], it would happily accept significant extinction risk for increased revenue. 2. There are plausible mechanisms which would allow a company to act like this even if virtually every employee would prefer the opposite. (For example, selectively hiring people with biased beliefs or exploiting collective action problems.) 3. In particular, you can hold that an AI company is completely untrustworthy even if you believe that all of its employees are fine people. Epistemic status & disclaimers: The mechanisms I describe definitely play some role in real AI companies. But in practice, there are more things going on simultaneously and this post is not trying to give a full picture.[2][3]Also, none of this is meant to be novel, but rather just putting existing things together and applying to AI risk. From financial point of view, bankruptcy is no worse than destroying the world Let's leave aside the question how real companies act. Instead, we start with a simple observation: If all a company cared about were financial interests, bankruptcy and the world getting destroyed are equivalent. Unsurprisingly, this translates to undesirable decisions in various situations. For example, consider an over-simplified scenario where an AI company somehow has precisely these two options[4]: * Option A: 10% chance of destroying the world, 90% of nothing happening. * Option B: Certainty of losing 20% market share. We could imagine that this corresponds to racing ahead (which risks causing the end of the world) and taking things slowly (which leads to a loss of revenue). But to simplify the discussion, we make the assumption that Option A has no benefit and everybody knows this. In this situation, if the company was following its financial interests (and knew the numbers), it should take Option A -- deploy the
Hm, I actually slightly disagree with that -- or rather, maybe these are different values of the same variable, but if so, that variable is multi-dimensional.
(1) Evaluation awareness: How accurate you are at answering "is this an eval or deployment?" when you are in evaluation.
(2) Deployment awareness: How accurate you are at answering the same question when you are in... (read more)