(Quick Thought)
Perhaps the goal for existing work targeting AI safety is less to ensure that AI safety happens, and more to make sure that we make AI systems that are strictly[1] better than the current researchers at figuring out what to do about AI safety.
I'm unsure how hard AI safety is. But I consider it fairly likely that mid-term (maybe 50% of the way to TAI, in years) safe AI systems are likely to outperform humans on AI safety strategy and the large majority of the research work.
If humans can successfully bootstrap more capable infrastructure than us, then our (humans) main work is done (though there could still be other work we can help with).
It might well be the case that the resulting AI systems would recognize that the situation is fairly hopeless. But at that point, humans have done they key things they need to do on this, hopeless or not. Our job is to set things up as best we can, more is by definition impossible.
Personally, I feel very doomy about humans now solving for various alignment problems of many years from now. But I feel much better about us making systems that will do a better job at guiding things then we could.
(The empirical question here is how difficult it is to automate alignment research. I realize this is a controversial and discussed topic. My guess is that many researchers will never agree with good AI systems, and always hold out on considering them superior - and that on the flip side, many people will trust AIs before they really should. Getting this right is definitely tricky.)
[1] Strictly meaning that they're very likely better overall, not that there's absolutely no area humans will be better than them.
Thanks for the clarification.
> But the thing I'm most worried about is companies succeeding at "making solid services/products that work with high reliability" without actually solving the alignment problem, and then it becomes even more difficult to convince people there even is a problem as they further insulate themselves from anyone who disagrees with their hyper-niche worldview.
The way I see it, "making solid services/products that work with high reliability" is solving a lot of the alignment problem. As in, this can get us very far into making AI systems do a lot of valuable work for us with very low risk.
I imagine that you're using a more specific definition of it than I am here.
I was thinking more of internal systems that a company would have enough faith in to deploy (a 1% chance of severe failure is pretty terrible!) or customer-facing things that would piss off customers more than scare them.
Getting these right is tremendously hard. Lots of companies are trying and mostly failing right now. There's a ton of money in just "making solid services/products that work with high reliability."
Social media companies have very successfully deployed and protected their black box recommendation algorithms despite massive negative societal consequences, and the current transformer models are arguably black boxes with massive adoption.
I agree that some companies do use RL systems. However, I'd expect that most of the time, the black-box nature of some of these systems is not actively preferred. They use them despite the black-box nature, because these are specific situations where the benefits outweigh the costs, not because of them.
"current transformer models are arguably black boxes with massive adoption." -> They're typically much less that of RL. There's a fair bit of customization that can be done with prompting, and the prompting is generally English-readable.
Your example of "Everything Inc" is also similar to what I'm expecting. As in, I agree with:
1. The large majority of business strategy/decisions/implementation can (somewhat) quickly be done by AI systems.
2. There will be strong pressures to improve AI systems, due to (1).
That said, I'd expect:
1. The benefits are likely to be (more) distributed. Many companies will be simultaneously using AI to improve their standings. This leads to a world where there's not a ton of marginal low-hanging-fruit for any single company. I think this is broadly what's happening now.
2. A great deal of work will go into making many of these systems reliable, predictable, corrigible, legally-compliant, etc. I'd expect companies to really dislike being blind-sighted by sub-AI systems that do bizarre things.
3. This is a longer-shot, but I think there's a lot of potential for strong cooperation between companies, organizations, and (effective) governments. A lot of the negatives of maximizing businesses comes from negative externalities and similar, which can also be looked at as coordination/governance failures. I'd naively expect this to mean that if power is distributed among multiple capable entities at time T, then these entities would likely wind up doing a lot of positive-sum interactions with each other. This seems good for many S&P 500 holders.
"or anything remotely like them, to “Everything, Inc.”, I just can’t. They seem obviously totally inapplicable."
This seems tough to me, but quite possible, especially as we get much stronger AI systems. I'd expect that we could (with a lot of work) have a great deal of:
1. Categorization of potential tasks into discrete/categorizable items.
2. Simulated environments that are realistic enough.
3. Innovations in finding good trade-offs between task competence and narrowness.
4. LLM task eval setups would get substantially more sophisticated and powerful.
I'd expect this to be a lot of work. But at the same time, I'd expect a lot of of it to be strongly commercially useful.
Thanks so much for that explanation. I've started to review those posts you linked to and will continue doing so later. Kudos for clearly outlining your positions, that's a lot of content.
> "We probably mostly disagree because you’re expecting LLMs forever and I’m not."
I agree that RL systems like AlphaZero are very scary. Personally, I was a bit more worried about AI alignment a few years ago, when this seemed like the dominant paradigm.
I wouldn't say that I "expect LLMs forever", but I would say that if/when they are replaced, I think it's more likely than not that they will be replaced by a system of a scariness factor that's similar to LLMs or less. The main reason being is that I think there's a very large correlation between "not being scary" and "being commercially viable", so I expect a lot of pressure for non-scary systems.
The scariness of RL systems like AlphaZero seems to go hand-in-hand with some really undesirable properties, such as [being a near-total black box] and [being incredibly hard to intentionally steer]. It's definitely possible that in the future some capabilities advancement might mean that scary systems have such a intelligence/capabilities advantage that this outweighs the disadvantages, but I see this as unlikely (though definitely a thing to worry about).
> I’m not sure what you mean by “subcomponents”. Are you talking about subcomponents at the learning algorithm level, or subcomponents at the trained model level?
I'm referring to scaffolding. As in, an organization makes an "AI agent" but this agent frequently calls a long list of specific LLM+Prompt combinations for certain tasks. These subcalls might be optimized to be narrow + [low information] + [low access] + [generally friendly to humans] or similar. This can be made more advanced with a large variety of fine-tuned models, but that might be unlikely.
"Do you think AI-empowered people / companies / governments also won't become more like scary maximizers?" -> My statements above were very focused on AI architectures / accident risk. I see people / misuse risk as a fairly distinct challenge/discussion.
I appreciate this post for working to distill a key crux in the larger debate.
Some quick thoughts:
1. I'm having a hard time understanding the "Alas, the power-seeking ruthless consequentialist AIs are still coming” intuition. It seems like a lot of people in this community have this intuition, and I feel very curious why. I appreciate this crux getting attention.
2. Personally, my stance is something more like, "It seems very feasible to create sophisticated AI architectures that don't act as scary maximizers." To me it seems like this is what we're doing now, and I see some strong reasons to expect this to continue. (I realize this isn't guaranteed, but I do think it's pretty likely)
3. While the human analogies are interesting, I assume they might appeal more to the "consequentialist AIs are still coming” crowd than people like myself. Humans were evolved for some pretty wacky reasons, and have a large number of serious failure modes. Perhaps they're much better than some of what people imagine, but I suspect that we can make AI systems that have much more rigorous safety properties in the future. I personally find histories of engineering complex systems in predictable and controllable ways to be much more informative, for these challenges.
4. You mention human intrinsic motivations as a useful factor. I'd flag that in a competent and complex AI architecture, I'd expect that many subcomponents would have strong biases towards corrigibility and friendliness. This seems highly analogous to human minds, where it's really specific sub-routines and similar that have these more altruistic motivations.
I've been working on an app for some parts of this. Plan to more formally announce it soon, but the basics might be simple enough. Eager to get takes. Happy to add any workflows if people have requests. (You can also play with adding "custom workflows", or just download the code and edit it).
Happy to discuss if that could be interesting.
https://www.roastmypost.org
The humans trusted to make decisions.
I’m hesitant to say “best humans”, because who knows how many smart people there may be out there who might luck out or something.
But “the people making decisions on this, including in key EA orgs/spending” is a much more understandable bar.