Even if we assume that's true (it seems reasonable, though less capable AIs might blunder on this point, whether by failing to understand the need to act nice, failing to understand how to act nice or believing themselves to be in a winning position before they actually are), what does an AI need to do to get in a winning position? And how easy is it to make those moves without them being seen as hostile?
An unfriendly AI can sit on its server saying "I love mankind and want to serve it" all day long, and unless we have solid neural net interpretability or some future equivalent, we might never know it's lying. But not even superintelligence can take over the world just by saying "I love mankind". It needs some kind of lever. Maybe it can flash its message of love at just the right frequency to hack human minds, or to invoke some sort of physical effect that let's it move matter. But whether it can or not depends on facts about physics and psychology, and if that's not an option, it doesn't become an option just because it's a superintelligence trying it.
What odds are you willing to give taking friction into account?
How much are you willing to bet? I’ll take you up on that up to fairly high stakes.
I agree that alien visits are fairly unlikely, but not 99% unlikely.
This seems untrue. For one thing, high-powered AI is in a lot more hands than nuclear weapons. For another, nukes are well-understood, and in a sense boring. They won’t provoke as strong of a “burn it down for the lolz” response as AI will.
Even experts like Yann LeCun often do not merely not understand the danger, they actively rationalize against understanding it. The risks are simply not understood or accepted outside of a very small number of people.
Remember the backlash around Sydney/Bing? Didn’t stop her creation. Also, the idea that governments are working in their nations’ interests does not survive looking at history, current policy or evolutionary psychology (think about what motivations will help a high-status tribesman pass on his genes. Ruling benevolently ain’t it.)
You think RLHF solves alignment? That’s an extremely interesting idea, but so far it looks like it Goodharts it instead. If you have ideas about how to fix that, by all means share them, but there is as yet no theoretical reason to think it isn’t Goodharting, while the frequent occurrence of jailbreaks on ChatGPT would seem to bear this out.
Maybe. The point of intelligence is that we don’t know what a smarter agent can do! There are certainly limits to the power of intelligence; even an infinitely powerful chess AI can’t beat you in one move, nor in two unless you set yourself up for Fool’s Mate. But we don’t want to make too many assumptions about what a smarter mind can come up with.
AI-powered robots without super intelligence are a separate question. An interesting one, but not a threat in the same way as superhuman AI is.
Ever seen an inner city? People are absolutely shooting each other for the lolz! It’s not everyone, but it’s not that rare either. And if the contention is that many people getting strong AI results in one of them destroying the world just for the hell of it, inner cities suggest very strongly that someone will.
Exercise and stimulants tend to heighten positive emotions. They don’t generally heighten negative ones, but that’s probably all to the good, right? Increased social interaction, both in terms of time and in terms of emotional closeness, tends to heighten both positive and negative emotions.
Strongly upvoted. This is a very good point.
150,000 people die every day. That's not a small price for any delays to AGI development. Now, we need to do this right: AGI without alignment just kills everyone; it doesn't solve anything. But the faster we get aligned AI, the better. And trying to slow down capabilities research without much thought into the endgame seems remarkably callous.
Eliezer has mentioned the idea of trying to invent a new paradigm for AI, outside of the conventional neural net/backpropagation model. The context was more "what would you do with unlimited time and money" than "what do you intend irl", but this seems to be his ideal play. Now, I wish him the best of luck with the endeavor if he tries it, but do we have any evidence that another paradigm is possible?
Evolved minds use something remarkably close to the backprop model, and the only other model we've seen work is highly mechanistic AI like Deep Blue. The Deep Blue model doesn't generalize well, nor is it capable of much creativity. A priori, it seems somewhat unlikely that any other AI paradigm exists: why would math just happen to permit it? And if we oppose capabilities research until we find something like a new AI model, there's a good chance that we oppose it all the way to the singularity, rather than ever contributing to a Friendly system. That's not an outcome anyone wants, and it seems to be the default outcome of incautious pessimism.
This is excellently written, and the sort of thing a lot of people will benefit from hearing. Well done Zvi.
Well said! Though it raises a question: how can we tell when such defenses are serving truth vs defending an error?
As for an easier word for “memetic immune system”, Lewis might well have called it Convention, as convention is when we disregard memes outside our normal mileu. Can’t say for Chesterton or Aquinas; I’m fairly familiar with Lewis, but much less so with the others apart from some of their memes like Chesterton’s Fence.
Good analogy, but I think it breaks down. The politician’s syllogism, and the resulting policies, are bad because they tend to make the world worse. I would say that Richard’s comment is an improvement, even if you think it might be a suboptimal one, and that pushing back against improvements tends to result in fewer improvements. Don’t let the perfect be the enemy of the good is a saying for very good reason.
The syllogism here is more like:
Something beneficial ought to be done
This is beneficial.
Therefore I probably ought not to oppose this, though if I see a better option I’ll do that instead of doubling down on this.