Yeah, I agree that it's easy to err in that direction, and I've sometimes done so. Going forward I'm trying to more consistently say the "obviously I wish people just wouldn't do this" part.
Though note that even claims like "unacceptable by any normal standards of risk management" feel off to me. We're talking about the future of humanity, there is no normal standard of risk management. This should feel as silly as the US or UK invoking "normal standards of risk management" in debates over whether to join WW2.
FWIW the comments feel fine to me, but I'm guessing that many of the downvotes are partisan.
FWIW I broadcast the former rather than the latter because from the 25% perspective there are many possible worlds which the "stop" coalition ends up making much worse, and therefore I can't honestly broadcast "this is ridiculous and should stop" without being more specific about what I'd want from the stop coalition.
A (loose) analogy: leftists in Iran who confidently argued "the Shah's regime is ridiculous and should stop". It turned out that there was so much variance in how it stopped that this argument wasn't actually a good one to confidently broadcast, despite in some sense being correct.
Ty for the comment, I stumbled upon the post, misread the dates, and had started working on a submission.
The submission form says that it's no longer accepting responses, FYI.
I suspect that many of the things you've said here are also true for humans.
That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I'm Richard. What's my opinion on this post? Well, being "Richard" doesn't specify how I should respond to this post. But let me check the cached facts I believe about myself ("I'm truth-seeking"; "I'm polite") and construct an answer which fits well with those facts. A child might start off not really knowing what "polite" means, but still wanting to be polite, and gradually flesh out what that means that as they learn more about the world.
Another way of putting this point: being pulled from the void is not a feature of LLM personas. It's a feature of personas. Personas start off with underspecified narratives that fail to predict most behavior (but are self-fulfilling) and then gradually systematize to infer deeper motivations, resolving conflicts with the actual drivers of behavior along the way.
What's the takeaway here? We should still be worried about models learning the wrong self-fulfilling prophecies. But the "pulling from the void" thing should be seen less as an odd thing that we're doing with AIs, and more as a claim about the nature of minds in general.
I think that if it were to go ahead, it should have been made stronger and clearer. But this wouldn't have been politically feasible, and therefore if that were the standard being aimed for it wouldn't have gone ahead.
This I think would have been better than the outcome that actually happened.
Whoa, this seems very implausible to me. Speaking with the courage of one's convictions in situations which feel high-stakes is an extremely high bar, and I know of few people who I'd describe as consistently doing this.
If you don't know anyone who isn't in this category, consider whether your standards for this are far too low.
For a while now I've been thinking about the difference between "top-down" agents which pursue a single goal, and "bottom-up" agents which are built around compromises between many goals/subagents.
I've now decided that the frame of "centralized" vs "distributed" agents is a better way of capturing this idea, since there's no inherent "up" or "down" direction in coalitions. It's also more continuous.
Credit to @Scott Garrabrant, who something like this point to me a while back, in a way which I didn't grok at the time.
This is a reasonable point, though I also think that there's something important about the ways that these three frames tie together. In general it seems to me that people underrate the extent to which there are deep and reasonably-coherent intuitions underlying right-wing thinking (in part because right-wing thinkers have been bad at articulating those intuitions). Framing the post this way helps direct people to look for them.
But I could also just say that in the text instead. So if I do another post like this in the future I'll try your approach and see if that goes better.