A critique of Soares "4 background claims"

Disclaimer: The article was translated from Russian using GPT. It was partially proofread, but I don't know English well and therefore there may be rare translation artifacts.

Show me exactly where you break Soares' “4 background claims”—if you don’t break them, it means you agree the assumption is correct. It makes sense to assume that the premise could be correct because, all other things being equal, if I’m highly uncertain, the error in one direction means a lot of effort, while in the other, it’s everything we’ve got.
(friend Slava to me in response to my skepticism.)

I’ve already irreversibly invested in reading safety articles, so let my role be to deeply doubt the initial premise of expectations. If they are so strongly tied to reality, they won’t break under skepticism, right?
In this essay, I will attack Soares’ “4 background claims” to question them and show why, even if I agree with some of them, my expectations of the future differ from Nate’s, or why my level of uncertainty is higher.

- If you don’t break them, it means you agree that the assumption is correct.

Let’s take a look at these 4 premises—do I agree with them as predictions?

Claim #1: Humans have a very general ability to solve problems and achieve goals across diverse domains.

We call this ability “intelligence,” or “general intelligence.” This isn’t a formal definition — if we knew exactly what general intelligence was, we’d be better able to program it into a computer — but we do think that there’s a real phenomenon of general intelligence that we cannot yet replicate in code.

I assumed, by “general” it refers to the ability to solve a wide variety of tasks through Is this premise sufficient for conclusions related to the existential threat of AI? For me, not yet.

This premise was valid 1,000 years ago in the same formulation, yet the risk of existential threat was lower (there were no nuclear weapons, no presumed AGI threat, no other risks associated with advanced technologies).

How can such a statement generate inaccurate predictions? If it is generalized as: “Since the ability to achieve goals in many areas exists, then in this particular area, where progress with frightening consequences has not yet been achieved, it could be—and soon.”

People have systematically erred in their forecasts when observing a sudden leap in progress within a field, generalizing to all other areas. In rarer cases, they were overly optimistic that a sharp leap did not indicate further acceleration of progress in that field soon after. However, the assessment depends on too many parameters, which people usually do not fully account for. Specialists in the field may have these parameters, but there are hundreds of examples where groups of specialists from different countries, despite having the most accurate models of a complex area, failed in their predictions due to hasty generalizations. and there was a transfer of intuitions from one area to another, where some of the transfer mechanisms do not match.

If some key part of this general intelligence was able to evolve in the few million years since our common ancestor with chimpanzees lived, this suggests there may exist a relatively short list of key insights that would allow human engineers to build powerful generally intelligent AI systems.

Maybe. Or maybe not. It’s possible that a “universal AI” (for which no one has an exact model) might not be achievable in a way that aligns with generalized intuitions about it. Recalling a rough quote from some alignment article, “Show me at least one detailed scenario where we survive,” I feel irritated because I detect an implicit assumption about “we die,” and I would be satisfied with “show me at least one detailed scenario where we die.”

If I postulate that an invisible dragon will appear in my garage in a year, I could start arguing with skeptics by saying, “Show me a detailed scenario in which the invisible dragon does NOT appear in my garage in a year.” They respond with something, but it’s not even a 10-page essay, and I claim the scenario isn’t detailed enough to convince me.

Alternatively, someone could press me to provide a detailed, likely scenario of the invisible dragon appearing in my garage in a year. I’m unlikely to write a 10-page essay with a detailed scenario, and even if I do, I suspect it would reveal many false assumptions and other erroneous predictions.

Thus, when I see a generalized statement like “X has many abilities to do many things in various domains,” it doesn’t shift my confidence that this X will lead the world in the near future into a highly specific, unlikely state (completely destroying humanity within a few years is a much more complex task than not destroying it).

Often, when you don’t have an exact model of a process but make first-system (in Kahneman's sense) generalizations from past cases to guess an outcome, guessing works because some past patterns align and generalization succeeds. However, when it comes to AGI, where some patterns have never been repeated before, generalization might fail.
And I dislike the comfort in statements with assumptions like “we will definitely die, and I don’t have a single exact scenario” because believing in such things once opens the door to overconfidence in other similar scenarios in the future. And this belief influences many motivations—just as the accuracy of one’s convictions depends on being comfortable or uncomfortable with such predictions.

"Humans have a highly universal ability to solve problems and achieve goals across various domains."

The phrase "a universal ability to solve problems" might create expectations that nearly all conceivable problems can be solved and that there’s no clear boundary where this ceases to be true. The author might interpret it this way, using it as a plug when trying to convince someone of a vague prediction, searching for arguments about why success is likely. Here, the umbrella-like nature of the "universal" cluster comes into play, creating an illusion of the broad applicability of human intelligence and its tools to a wide range of tasks—including tasks we’re not even aware of yet.
If your goal is to instill fear, you might optimize the generalizability of these capabilities and outcomes. The broader they seem, the greater the uncertainty—and fear often increases with uncertainty about outcomes. However, the concept of universality loses sensitivity to specific scenarios, allowing the word "universal" to to imply vague predictions like "it will somehow kill everyone" or "a complex, unlikely scenario will somehow occur."

Suggested revision: Replace the phrase with something like:

"Humans have the ability to solve many different problems, more than other animals, but this ability is limited by the laws of physics and the environment."

If you observe the consequences of human cognitive systems without understanding the details of how they work, you might lump them all into one bucket, label it with a specific word—say, "intelligence"—and then recall the achievements of this mechanism ( building this, solving that). From there, it’s easy to start accumulating expectations about a "vague mechanism related to computation with words and images." You might then combine this with remembered results from the past and use generalizations to predict the future.
But these generalizations—and the use of the same word, "intelligence," for "similar" phenomena—can lead to false predictions, such as:

The "soul" like humans have, just because a computer with GPT "responds like a person" (anthropomorphism).
Humans are scary and create mafias and corporations,so they must have something vague about "compute and images" capability. AI demonstrates something "similar," from the same bucket labeled with one word. With minimal cognitive effort, one can imagine—and not even be surprised—that this "vague computational mechanism" could also be dangerous, creating mafias and corporations.
Skynet and other stereotypes from pop culture demonstrate how easily such generalizations occur. People are so free of dissonance here that they present arguments about the future based on movies featuring humanized robots—and then use this to make forecasts.

l am worried that, despite an expressed "lack of formalization" or a fixed list of expectations (and domains of this expectations"), Nate (in other articles) allows himself bold predictions with strong signaling of confidence amidst such great uncertainty.

***

Pay attention to how your vague expectations —connected to something nonspecific yet intrinsically unpleasant when "felt" internally—may become insensitive to environmental constraints that limit your prediction. The vaguer your scenario and the greater the uncertainty, the more I expect people to succumb to this tendency. When unpleasant feelings arise from modeling something vague and undesirable, there’s a temptation to agree with the conclusion that this unpleasant outcome is certain (to escape the uncertainty, which is physiologically uncomfortable).

Example: I noticed in myself that it is physiologically unpleasant to imagine crossing on a red light, even if I saw that for 400 meters to the left and right there are no cars. Observing this mechanism, I could verbalize it as "I don’t want to cross on red; something terrible will happen." But when I asked myself what exactly terrible and how, if there are no cars, my analytical part would say, "I don’t know how. Yes, there are no cars, and 100% none will appear in 15 seconds." But it’s still scary out of habit. If I didn’t have a block on the word "terrible," I would have called it that.

That is, fear and high confidence, physiologically driven, are present even in a precision-obsessed rationalist, and physiology wins over even this.

What can I say, then, when you DO NOT KNOW if there will be a car around the corner, if it will be something else, or even if there’s a ton of other uncertainty? Fear, by the same pattern that stopped me from crossing the road, tempts you even more to neglect the precision of models like "how exactly something will happen" and to settle for the bottom line: "something bad and possibly fatal," because this very bottom line feels more comfortable compared to the uncertainty.

Claim #2: AI systems could become much more intelligent than humans.

Researchers at MIRI tend to lack strong beliefs about when smarter-than-human machine intelligence will be developed. We do, however, expect that (a) human-equivalent machine intelligence will eventually be developed (likely within a century, barring catastrophe); and (b) machines can become significantly more intelligent than any human.

Here, I already start to doubt the thesis, tied to the umbrella term "smarter," which is not unpacked in the article. The word may refer to various cognitive mechanisms, different states of model accuracy, different computational abilities, and much more.

I can imagine tasks associated with the genre of "intellectual" where even the strongest AI will not be able to cope due to the limitations of input-output systems right now (complex social interactions where, to achieve a challenging result, you need to keep generalized models of people in memory, features of a changing external environment, a ton of other contexts that were "compressed" in the brain and sorted by "stress strength" to certain strategies in a certain context). Even if most tasks, I expect, will be solvable, at a certain level of complexity related to the need for having an accurate world model right now, for strong AI:

There will be no relevant data linked to the world model that humans have, because if a person updates their model through sensory organs, AI will not have them to the same extent. And if it does, the world model will inevitably be distorted or absent in some places. And in these cases, it will not be "smarter" in terms of its connection to the territory and task performance efficiency, since this connection is necessary for the result.
In order for me to change my confidence here, I need to see provide systematic cases where this will be achieved in such complex tasks.
Oleg L.(my favourite russian rationalist) still claims from his analytical standpoint that no AI will ever be able to accurately predict the weather at a specific location in 3 months – I don’t yet have models to be at the same understanding level as Oleg, but I see it as evidence. If that’s the case, AI will also not be “smart” (be able to predict) for such tasks.
For me to start considering the possibility, I want detailed scenarios of how this happens (not "come up with a detailed scenario where this doesn’t happen").
But even if you come up with the most detailed scenario, where an invisible dragon from Jupiter ends up in your garage in a year, it won't be enough for me to believe in it.

Is this premise sufficient for conclusions related to the existential threat of AI? Not for me yet, and I disagree with it based on the expectation that for certain complex tasks, where humans succeed with more accurate models, AI won’t succeed due to insufficient accuracy.

Claim #3: If we create highly intelligent AI systems, their decisions will shape the future.

Humans use their intelligence to create tools and plans and technology that allow them to shape their environments to their will (and fill them with refrigerators, and cars, and cities). We expect that systems which are even more intelligent would have even more ability to shape their surroundings, and thus, smarter-than-human AI systems could wind up with significantly more control over the future than humans have.

First of all, I see a way to get confused by the word "intelligence" in this context – because the same word is used, one might form inaccurate intuitions about the processes that occur in the brains of humans, and how similar they are to processes inside AI.
I expect that this is rarely the case. At a certain level of reductionism, I expect that the mechanism for solving similar tasks in AI and humans is so different that if we had the inner workings of the most powerful neural network for one task and a human brain solving the same task, the difference would be so striking that anyone with detailed models wouldn’t claim that they are the same. I make this conclusion based on my intuition about the difference between fleshy brains and neural network architecture.
And judging by what I have asked people studying neurophysiology, they signaled that the field is very poorly understood. I will strongly penalize high confidence in such similarity, but I can easily imagine this confusion in others as a result of the fact that both of these things are called by the same word.

Thus, when I say AI intelligence, I do not have the same model as human intelligence, I cannot equate them calmly, and that means that, unlike Nate, I do not expect neural networks to change their surroundings, create tools and technologies like humans do. And by "like," I mean by a similar mechanism – if the mechanism is not similar, then I increase the uncertainty in predictions. And because of this uncertainty, I experience stress when believing that AI smarter than humans will control the future more than humans.

I penalize Nate for generalization here – what kind of future? Is it really referring to any future? I strongly expect that the ability to control some future will be limited by insufficient capabilities, control from humans, and other obstacles, such as a deliberately inaccurate map that will not allow for making the most accurate predictions about the future.

We can imagine how such an AI develops nanomachines and uses them to convert as much matter as possible into stamps.

I can also imagine that, and there's a strong temptation to leave this fantasy vague – not thinking about how the specific processes of maintaining the functioning of these machines will unfold. I cannot imagine the process happening without intervention from controlling humans who won't ask questions. I can’t imagine anyone not noticing that an agent, say, pretending to be human, produces a lot of stamps and receives a lot of resources, and no one is trying to control or take them away.

I can easily imagine how the human brain works, stopping at such vagueness in order to fit it into the conclusion from the bottom line. If the goal is to preserve the bottom line, it's disadvantageous to break one's own scenarios.

And while I read articles on alignment where vague, short scenarios are presented, I systematically don’t see anyone doubting the final conclusion due to the complexity. I have an intuition that the more complex the plan, the less likely it is.

But so far, it seems to me that proponents of “strong AI will kill us all” forget this heuristic in order to keep the bottom line untouched.

Claim #4: Highly intelligent AI systems won’t be beneficial by default.

We’d like to see the smarter-than-human AI systems of the future working together with humanity to build a better future; but that won’t happen by default. In order to build AI systems that have a beneficial impact, we have to solve a number of technical challenges over and above building more powerful and general AI systems.

"but that won’t happen by default." – and there is no justification for this statement – Bravo, Nate, argumentation of the gods. Where is the level of indirectness? Where is "I think that this will not happen by default"? Or "I expect"? Is this a way to quickly influence the reader so they’ll "swallow" the bottom line?

In order to create AI that has a beneficial impact, we need not just to create more powerful and universal AI systems, but also to overcome several technical obstacles.

Alright, I won’t nitpick the word “beneficial.” I am writing from a universe where many people would call GPT-4's responses "beneficial" for solving some of their tasks. Is GPT-4 smarter than many people in some respects? Yes. In all respects? No. Do they call GPT “useful”? Consistently. How “strong” is GPT? For solving many tasks, I would call it strong, though it could be stronger. We already have evidence of "strength" and "intelligence," but not in a vacuum, rather in each specific case.

The title has the thesis "highly intelligent." The article was written in 2015. Now it’s 2024. I can calmly apply "highly intelligent" to GPT-4 for solving certain tasks where most people would perform worse. And how, "useless"? "Dangerous"? Does this shift me? Yes, but not in the same direction as Bostrom’s, and the arguments don’t convince me. This all feels like a grand final line that is not even questioned. I can easily shift back, but I don't yet see on what grounds, and I’m not convinced by the arguments.

A sufficiently intelligent AI will be able to determine our intentions and preferences. However, this does not imply that its actions will align with our preferences.

I doubt the “will be able to determine preferences” part (especially without indirectness, like "eat this"). Behind this word is a vast machinery of causality, which I expect will be entirely inaccessible to any powerful AI trying to map the intricate network of relationships within the human brain. By "preferences," I mean an accurate model of processes inside the brain, including all minor associations, subtle sensations, and, most importantly, stable predictions about what reaction will follow a certain stimulus — I expect that having such a predictive and accurate model is an impossible task in our universe, and I am confident that no AI will ever come close to it. Based on such expectations, Nate’s statement that AI will be able to determine our preferences evokes strong indignation in me. In other words, I argue that it won’t be able to. If anything, it might have a high probability of guessing verbal formulations or actions based on past data, which will align with people's agreement, but there will be failures over time and in more complex cases.

However, I expect many tasks where the AI will provide an output that people will deem useful by default. Even part of its usefulness will already find application.

If an agent is initially programmed to follow plans leading to a universe where, as it predicts, cancer will be cured, it will modify its goals only if it predicts that this will lead to cancer being cured.

I think not just that. I expect that this line reveals the primitiveness of Soares’s model if he is serious here. Because this scenario assumes that curing cancer is somehow the main variable, and we overlook the influence of other variables on it. I expect that in reality, this won’t work because of the obvious presence of other variables, not just cancer. I can't imagine that specialists in this field wouldn’t add at least one more constraint under which the program does not stop its current activity if the condition is not met (e.g., “I will strive to cure cancer if conditions 1-555 are met”).

But even if we imagine someone releasing an AI with just one such condition, without constraints 1-555, we still have the likely issue of the vague variable “cancer cured/not cured,” which is susceptible to Goodharting, and we’ll be “saved” by the inaccuracies in the model map that won’t allow it to achieve much due to its flawed world model and limited influence capabilities. It seems that this applies to any models with imprecise goals baked in.

Goal modification — remember — is not a free goal; achieving it requires physical resources — and I can’t imagine how AI could block access to these resources for all humans worldwide. I expect such a blocking to be impossible. If this is true, interventions in case of undesirable consequences for humans will be inevitable (and quite determined), and I doubt AI would have the capacity to prevent this. And I need something stronger than arguments from such an article to convince me otherwise.

There is no spark of empathy that automatically causes sufficiently capable computers to respect other intelligent beings. If you want empathy, you need to program it.

If I were tasked with programming empathy, I would first model how empathy works in humans — meaning I expect that in terms of causal relationships, it involves simulating an unpleasant sensation within the head of a living being, checking for plausibility, and if confirmed, activating a model of the same sensation in the AI — motivating it to eliminate this stress in the limbic system, affecting the cause, i.e., the stress model of another entity — empathy as a consequence of this stress.

It seems we already have an equivalent of stress in software form — low rewards.

In conclusion: I disagree with the predictions of this premise based on the current usefulness of AI, high uncertainty, which doesn’t allow me to comfortably lean toward a definitive hypothesis, especially when existing evidence (GPT-4 and others) goes against it. Furthermore, I am fairly confident in the inevitable control of undesirable outcomes by humans.

____________________

P.S.

"The argument about the importance of artificial intelligence rests on these four statements"

Importance for what purposes? In a vacuum again? For most purposes, guess which ones? Should I replace the word "important" here with "important for survival or for pleasant consequences, which means impossible without fulfilling these 4 conditions"? Should I interpret this as important to Nate Soares, in terms of the word "important" being linked to his stress about the absence of something related to these premises? On my map, the word "important" almost always points to feelings and preferences, and people abuse this word using the mind projection fallacy to create some kind of stable property of "important" on some object so that you would consume this seeming property regardless of who said the word, and thus there was a transfer of attitude from one person to another, and since there are seemingly stable properties, you will have stress about losing this thinking habit which will give motivation to preserve the property. That is, to experience anxiety about the absence of importance.

Since Nate doesn't decode the word "important" here, I'll have to decode it in my own way, and I'll decode it like this — the argument about the importance of artificial intelligence, I suppose, means that Nate wants to transfer to you his emotional attitude towards the things he's saying. The word "important," I expect, is always connected with stress about the absence of the important thing. Even if this is from happiness, in this essay I expect Nate's task is to add stress to you about what he wrote and peace about his arguments. I prefer to remain uncertain about stress, moving slightly in the opposite direction based on current evidence, and will be very skeptical of Nate's current arguments, creating artificial anxiety about them for myself so that there would be motivation to look for counter-theses and not refuse to become a believer in a highly uncertain unambiguous outcome.

LESSWRONG
LW