Thanks! I mentioned anthropic in the post, but would similarly find it interesting if someone did a write up about cohere. It could be that OAI is not representative for reasons I don't understand.
Instagram is the closest I can think of, but that was ~20x smaller and an acquisition, not an investment
I tried playing the game Nate suggested with myself. I think it updated me a bit more towards Holden's view, though I'm very confident that if I did it with someone who was more expert than I am both the attacker and the defender would be more competent, and possibly the attacker would win.
Attacker: okay, let's start with a classic: Alignment strategy of "kill all the capabilities researchers so networks are easier to interpret."
Defender: Arguments that this is a bad idea will obviously be in any training set that a human level AI researcher would be trained on. E.g. posts from this site.
Attacker: sure. But those arguments don't address what an AI would be considering after many cycles of reflection. For example: it might observe that Alice endorses things like war where people are killed "for the greater good", and a consistent extrapolation from these beliefs is that murder is acceptable.
Defender: still pretty sure that the training corpus would include stuff about the state having a monopoly on violence, and any faithful attempt to replicate Alice would just clearly not have her murdering people? Like a next token prediction that has her writing a serial killer manifesto would get a ton of loss.
Attacker: it's true that you probably wouldn't predict that Alice would actively murder people, but you would predict that she would be okay with allowing people to die through her own inaction (standard "child drowning in a pond" thought experiment stuff). And just like genocides are bureaucratized such that no single individual feels responsible, the AI might come up with some system which doesn't actually leave Alice feeling responsible for the capabilities researchers dying.
(Meta point: when does something stop being POUDA? Like what if Alice's CEV actually is to do something wild (in the opinion of current-Alice)? I think for the sake of this exercise we should not assume that Alice actually would want to do something wild if she knew/reflected more, but this might be ignoring an important threat vector?)
Defender: I'm not sure exactly what this would look like, but I'm imagining something like "build a biological science company that has an opaque bureaucracy such that each person pursuing the goodwill somehow result in the collective creating a bio weapon that kills capabilities researchers" and this just seems really outside what you would expect Alice to do? I concede that there might not be anything in the training set which specifically prohibits this per se, but it just seems like a wild departure from Alice's usual work of interpreting neurons.
(Meta: is this Nate's point? Iterating reflection will inevitably take us wildly outside the training distribution so our models of what an AI attempting to replicate Alice would do are wildly off? Or is this a point for Holden: the only way we can get POUDA is by doing something that seems really implausible?)
Yeah that's correct on both counts (that does seem like an important distinction, and neither really match my experience, though the former is more similar).
I spent about a decade at a company that grew from 3,000 to 10,000 people; I would guess the layers of management were roughly the logarithm in base 7 of the number of people. Manager selection was honestly kind of a disorganized process, but it was basically: impress your direct manager enough that they suggest you for management, then impress your division manager enough that they sign off on this suggestion.
I'm currently somewhere much smaller, I report to the top layer and have two layers below me. Process is roughly the same.
I realized that I should have said that I found your spotify example the most compelling: the problems I see/saw are less "manager screws over the business to personally advance" but rather "helping the business would require manager to take a personal hit, and they didn't want to do that."
For what it's worth, I think a naïve reading of this post would imply that moral mazes are more common than my experience indicates.
I've been in middle management at a few places, and in general people just do reasonable things because they are reasonable people, and they aren't ruthlessly optimizing enough to be super political even if that's the theoretical equilibrium of the game they are playing.[1]
This obviously doesn't mean that they are ruthlessly optimizing for the company's true goals though. They are just kind of casually doing things they think are good for the business, because playing politics is too much work.
FYI I think your first skepticism was mentioned in the safety from speed section; she concludes that section:
These [objections] all seem plausible. But also plausibly wrong. I don’t know of a decisive analysis of any of these considerations, and am not going to do one here. My impression is that they could basically all go either way.
She mentions your second skepticism near the top, but I don't see anywhere she directly addresses it.
think about how humans most often deceive other humans: we do it mainly by deceiving ourselves... when that sort of deception happens, I wouldn't necessarily expect to be able to see deception in an AI's internal thoughts
The fact that humans will give different predictions when forced to make an explicit bet versus just casually talking seems to imply that it's theoretically possible to identify deception, even in cases of self-deception.
Thanks for the questions!