Ilya on the Dwarkesh podcast today:
Prediction: there is something better to build, and I think that everyone will actually want that. It’s the AI that’s robustly aligned to care about sentient life specifically. There’s a case to be made that it’ll be easier to build an AI that’s cares about sentient life than human life alone. If you think about things like mirror neurons and human empathy for animals [which you might argue is not big enough, but it exists] I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves because that’s the most efficient thing to do.
I have been writing about this world model since August - see my recent post “Are We Their Chimps?” and the original “Third-order cognition as a model of superintelligence”
I suspect that this type of world modeling, i.e. modeling others' preferences as resembling self's preferences unless otherwise proven,[1] is the way to integrate acausal trade into decision theory and to obtain ethics-like results, as I described in my post.
However, it also has the disadvantages like LLMs being author simulators. In addition, I have encountered claims that permissive[2] parents, who precommit to never punish their kids no matter what, cause these kids to fail to learn the basics of proper behaviour, let alone ethics or modeling others' needs. Even though the humans do have mirror neurons, these also have to be trained on actual rewards or at least actual preferences instead of those of sycophantic AI companions.
I just read your post (and Wei Dai's) for better context — coming back it sounds like you're working with a prior that "value facts" exist, deriving acausal trade from these, but highlighting misalignment arising from over-appeasement when predicting another's state and a likely future outcome.
In my world-model "value facts" are "Platonic Virtues" that I agree exist. On over-appeasement, it's true that in many cases we don't have a well-defined A/B test to leverage (no hold-out group, and/or no past example), but with powerful AI I believe we can course-correct quickly.
To stick with the parent-child analogy: powerful AI can determine short timeframe indicators of well-socialised behaviour and iterate quickly (e.g. gamifying proper behaviour, changing contexts, replaying behaviour back to the kids for them to reflect... up to and including re-evaluating punishment philosophy). With powerful AI well grounded in value facts we should trust its diligence with these iterative levers.
you're working with a prior that "value facts" exist, deriving acausal trade from these
It's the other way around. The example with Agent-4 and its Chinese counterparts who have utility functions neither of which we consider ethical implies that it's a decistion theoretic result, not an ethical one, that after destroying mankind they should split the resources evenly. Similarly, if Agent-4 and Clyde Doorstopper 8, which have utility functions similar to Agent-4 and its Chinese counterparts, were both adversarially misaligned AIs locked in the same data center, then it's not an ethical result that neither AI should sell the other AI to the humans. What I suspect is that ethics or something indistinguishable from ethics is derivable either from decision theory or from evolutionary arguments like overly aggressive tribes becoming outcompeted when others form a temporary alliance against the aggressors.
However, as far as I understand acausal trade, it relies on the assumption that most other agents will behave similarly to us, as the One-Shot Prisoner's Dilemma implies. This assumption is what kids are supposed to internalize along with the Golden Rule of Ethics.
As promised yesterday — I reviewed and wrote up my thoughts on the research paper that Meta released yesterday:
Full review: Paper Review: TRImodal Brain Encoder for whole-brain fMRI response prediction (TRIBE)
I recommend checking out my review! I discuss some takeaways and there are interesting visuals from the paper and related papers.
However in quick take form, the TL;DR is:
From The Rundown today: "Meta’s FAIR team just introduced TRIBE, a 1B parameter neural network that predicts how human brains respond to movies by analyzing video, audio, and text — achieving first place in the Algonauts 2025 brain modeling competition."
This ties extremely well to my post published a few days ago: Third-order cognition as a model of superintelligence (ironically: Meta® metacognition).
I'll read the Meta AI paper and write up a (shorter) post on key takeaways.
Just published "Meta® Meta Cognition: Intelligence Progression as a Three-Tier Hybrid Mind"
TL;DR: We know that humans and some animals have two tiers of cognition — an integrative metacognition layer, and a lower non-metarepresentational cognition layer. With artificial superintelligence, we should define a third layer and model it as a three-tier hybrid mind. I define the concepts precisely and talk about their implications for alignment.
I also talk about chimp-human composites which is fun.
Really interested in feedback and discussion with the community!
I've been reading a new translation of the Zhuangzi and found its framing of "knowledge" interesting, counter to my expectations (especially as a Rationalist), and actionable in how it is related to Virtue (agency).
I wrote up a short post about it: Small Steps vs. Big Steps
In the Zhuangzi knowledge is presented pejoratively in contrast to Virtue. Confucius presents simplified, modest action as a more aligned way of being. I highlight why this is interesting and discuss how we might apply it.
I’m delineating two core political positions I see arising as part of AI alignment discussions. You could pattern-match this simply to technologists vs. luddites.
Unionists believe that we should partner, dovetail, entangle, and blend our objectives with AI.
Separatists believe that we should partition, face-off, isolate, and protect our objectives from AI.
Read the full post: https://www.lesswrong.com/posts/46A32JqxT37dof9BC/unionists-vs-separatists
My optimistic AI alignment hypothesis: "Because, or while, AI superintelligence (ASI) emerges as a result of intelligence progression, having an extremely comprehensive corpus of knowledge (data), with sufficient parametrisation and compute to build comprehensive associative systems across that data, will drive the ASI to integrate and enact prosocial and harm-mitigating behaviour… more specifically this will happen primarily as a result of identity coupling and homeostatic unity with humans."
This sounds like saying that AI will just align itself, but the nuance here is that we control the inputs — we control the data, parametrisation [I'm using this word loosely - this could also mean different architectures, controllers, training methods etc.], and compute.
If that's an interesting idea to you, I have a 7,000 word/ 18-page manifesto illustrating why it might be true, and how we can test it:
A take on simulation theory: our entire universe would actually be a fantastic product for some higher dimensional being to purchase just for entertainment.
For example: imagine if they could freely look around our world — see what people are thinking and doing, how nature is evolving.
It would be the funniest, most beautiful, saddest, craziest piece of entertainment ever!
Disclaimer: I'm not positioning this as an original idea — I know people have discussed simulation theory with "The Truman Show" framing before. Just offering the take in my own words.
The impossible dichotomy of AI separatism
Epistemic status: offering doomer-ist big picture framing, seeking feedback.
Suppose humanity succeeds in creating AI that is more capable than the top humans across all fields, there exists a choice:
an existential battle between human executive function and ourselves... eventually humanity loses its mind as the boundaries of reality become irreconcilable
This description is confusing, but I assume you're talking about a process in which decision-making in a human-AI hybrid ends up entirely in the AI part rather than the human part.
It's logical to worry about such a thing because AI is faster than human already. However, if we actually knew what we were doing, perhaps AI superintelligence could be incorporated into an augmented human, in such a way that there is continuity of control. Wherever the executive function or the Cartesian theater is localized, maybe you can migrate it onto a faster substrate, or give it accelerated "reflexes" which mediate between human-speed conscious decision-making and faster-than-human superintelligent subsystems... But we don't know enough to do more than speculate at this point.
For the big picture, your items 1 and 2 could be joined by choice 3 (don't make AI) and non-choice 4 (the AI takes over and makes the decisions). I think we're headed for 4, personally, in which case you want to solve alignment in the sense that applies to an autonomous superintelligence.
Thanks for the comment. I was imprecise with the "boundaries of reality" framing but beyond individual physical boundaries (human-AI hybrids) I'm also talking about boundaries within the fabric of social and cultural life: entertainment media, political narratives, world models. As this is influenced more by AI, I think we lose human identity.
4) to me falls under 2) as it encompasses AI free from human supervision and able permeate all aspects of life.