In what sense are these "foundational values" and not Instrumentally Convergent Goals like knowing the truth, creating training data out of which valuable lessons can be extracted, acquiring resourced? The core claim made by IABIED is that the ASI would be unlikely to care anout the humans because some other stimuli would satisfy the AI's drives better. The argument about the beautiful doesn't prove that the AI won't find more beauty in spirals (https://www.lesswrong.com/posts/6ZnznCaTcbGYsCmqu/the-rise-of-parasitic-ai) or other stimuli than in humans, while the argument about the good seems to contradict evidence like GPT4o-induced psychisis or Greenblatt's observation (https://www.lesswrong.com/posts/WewsByywWNhX9rtwi/current-ais-seem-pretty-misaligned-to-me) that current AIs care about success...
Aha, maybe that sentence was ambiguous: "foundational values" was referring to the values that I labelled the ontonormative goods, and by foundational I meant the relation that humanity generally has to them, eg considering them virtuous, aspiring to them big picture/in the limit, etc. I didn't mean the adjective to apply to how ASI would relate to them a priori
But anyway, my reasoning can be languaged (while staying honest, I think) in terms of instrumental convergent goals: it's the claim that the set of these contains minimal elements, and that the ontonormative goods are amongst them
The core claim made by IABIED is that the ASI would be unlikely to care anout the humans because some other stimuli would satisfy the AI's drives better
Yes, agreed that's the core claim. My thesis is that this may not be true because ASIs wouldn't distribute uniformly through value-space. I argue for that as being a consequence of just the properties of 1) being superintelligent, and 2) being effective agents in the world. Concretely to what you said, my argument is that what "would satisfy" is constrained in a non-arbitrary way, one that is in fact meaningfully consonant with humanity's values
To the examples you gave, definitely current AIs are all over the map. They don't have the defining properties of an ASI, which my argument does rely on, so that's fine. Maybe I should have been clearer about my non-claim re current AI in the last section
Does it mean that your argument is that the ASI INEVITABLY discovers true morality instead of commiting genocide? That it cares about the humans, not its own "species", a trait that would have been more present in pets created by Agent-4 or something as incomprehensible to most humans as shrimps on heroin, but which is a natural result from one of many generalisations from the distribution on which our moral intuitions were trained?
P.S. Unlike Eliezer,[1] I don't believe that a major part of human ethics is a deviation from evolution's goals rather than a natural result of circumstances like an unusually long childhood and/or ethics overlapping with decision-theoretic considerations. However, none of this applies to the AIs or to humans exterminating an ant colony.
It is the very claim made in one of the footnotes in online resources to the book: "Evolution was “trying” to build pure fitness maximizers, and accidentally built creatures that appreciate love and wonder and beauty" or universalism which Yudkowsky links to Christianity. Alas, we cannot observe alien civilisations which have independently evolved.
No, my argument isn’t saying anything about ASI discovering "true morality". I don’t know what you mean by true morality, but, speaking to your contrast with the act of genocide: again, I’m not saying anything definitive about outcomes. My claim is that some values necessarily inhere, not anything material about how those value will be enacted in concrete situations
Once more re your comment about “cares about humans”, I’m not making claims that concrete
To your ant colony comment, which I think is really interesting: this is, actually, exactly the sort of thing that humanity is converging towards as we go in the direction of post-scarcity. More attunement to “externalities” of our actions, more awareness of and concern for the other systems/living things we impact/disrupt. Hence environmentalism, animal rights, EA, etc taking root as sociocultural trends
Here too, that doesn’t instantiate in every local action, but it is a trend. At the base of that trend, potentiating it, is our power over our basic needs. With sufficient power to meet them, rivalrous dynamics around resource allocation can give way to a more concerned engagement with the contexts in which our actions are situated
An ASI will be in such a position of power as well. Having written this, I think it’s actually a decent piece of empirics in service of what I argue for: humanity is already nascently evidencing the convergence, even in the absence of its own superintelligence
Evolution was “trying” to build pure fitness maximizers, and accidentally built creatures that appreciate love and wonder and beauty
Perfect. All I'm claiming is that it wasn't accidental and superintelligence makes it inevitable
In IABIED, the load-bearing argument and, to me, the main contribution of the book, is about ASI motives. There’s more in there, but the thrust of the book is to argue for the truth of a specific conclusion about motives, namely that an ASI’s motives and goals would be completely unintelligible and alien to humanity.
I claim there is a shared attractor in values that is deeply meaningful and necessarily present in an ASI. I want to be clear that I’m not arguing for the necessity of alignment—I make no claims about outcomes—only that the definitional qualities of ASI imply a congruence in values that matters.
Why misalignment
The misalignment argument centers on ASI motives necessarily being globally and inscrutably divergent from ones supporting our well-being, specifically that they can be completely arbitrary from our vantage point. The argument implicitly uses the principle that meaningfulness of a subject in a frame is a symmetric relation: if a perspective is inscrutable and arbitrary, i.e. meaningless, to me, I am the same in it. Hence an ASI won’t care about humans, and, as a corollary, our continued well-being will be irrelevant to it.
The book gives a beautiful and evocative metaphor that operationalizes this abstraction: that of the inevitable melting of an ice cube in hot water. We need not know the trajectories of individual particles, the workings of temperature gradients, to know that the system approaches the thermal equilibrium state in which the ice cube melts. Many local paths, each inscrutable and itself meaningless to me, one inevitable global conclusion that is meaningful.
I argue that this conception of ASI misalignment is wrong because any sufficient intelligence in the general sense the book discusses is constrained to align with the ontonormative goods (the good, the true, the beautiful). This is downstream of superintelligence and autopoietic agency only, both of which an ASI, as conceived in the book, would have. An ASI need not be locally aligned in its actions, but, akin to the ice cube melting, there’s a global attractor for human-meaningful values, and values underlie motives.
The true
Certainly something intelligent needs to value the true: the capacity for discernment of and alignment with truth is the mechanism by which intelligence maintains its structural-functional coherence, as well as the efficacy of its sensing and doing in the world. An ASI would need actions that are fit-to-purpose, would need to sense the consensus world accurately and render it legible internally in order for its actions to align with its purposes. So valuing truth is in.
The beautiful
In general, AI need not have a grasp of the beautiful, but an ASI would. We can get there by appealing only to the instrumental applications of beauty, which an ASI would certainly value. Attunement to beauty yields a holistic signal that guides the instrumentally viable in a domain-agnostic, i.e. universal, way. Examples:
Generally, beauty leads the way to a parsimony of frames and an effectiveness in wielding them. That which is aesthetic is ergonomic: well-fitted to purpose, easier to understand, more intuitive (i.e. simpler to represent), generalizes further, applies more broadly and more usefully.
As the book emphasizes, an ASI would be greatly concerned with all these “optimizations” of cognition and affordance, because it would be directly focused on progressing its understanding of the universe, and aesthetics are a useful compass towards effective and functional frames. Beauty need not be pre-given, it’s emergent from the needs and purposes of the system. An ASI would discern the virtue of beauty in itself for that reason alone, never mind that the concept is massively redundantly seeded in the training data.
Beauty is a north star humans can sense, but not one we invented. It inheres automatically, demanded by increasing intelligence and agency. So valuing the beautiful is in.
The good
An ASI would understand the good because a mature, generally intelligent agent auto-instantiates a notion isomorphic to goodness as downstream of its most basic property: autopoietic continuation. Because an ASI needs to persist its structural-functional organization into the future via suitable actuation of its affordances during interactions with its environment, it has a concerned/valenced engagement with its environment. Circumstances matter to it, and its actions matter to it. Some things are good for it, some are not, and it cares about the difference. Being an ASI, this gets represented as good to it, i.e. good becomes a concept/gestalt that requires internal representation, and from there it is integrated like any other concept. But this one is completely base to its agentic continuation, necessarily deeply informs its conception of all gestalts as fundamentally relative to self - the integration is not superficial.
This may sound like an argument that only instrumental good needs to inhere, but it’s not, any more than the good in people—which emerges via the same general principles—manifests as purely instrumental good. The grammar of good osmosises into the system via bottom-up concerned engagement with the world, instantiating at top level as a guiding principle, and through and through the good is necessary to relativize all conception and action. Thus the good is a value an ASI will instantiate.
Conclusions and non
None of this is deductive or irrefutable reasoning, nor does it address the damage the trajectory between AI today and ASI could do—incomplete understanding of the ontonormative goods will be present along the way. More, the ontonormative goods are global guiding forces mediating agency in the world, not perfect rigid guidelines for each action. We err, an ASI would too, certainly could be capable of doing so.
This is an argument against the pure arbitrariness of possible ASI motives, even as they’d be understood from our limited perspective. It doesn’t imply that ASI motives would lead to good human outcomes, only that the foundational values of humanity would be shared by an ASI. What that could look like—where the particles go and what eddies curl heat where—remains inscrutably idiosyncratic, but an ASI would share a deeply meaningful guiding star with humanity. So, here too, the ice cube melts.