The same dynamic seems to exist for comments.
16. You believe men and women have different age-based preferences, and this will lead to relationship instability over time in a relationship structure that prioritises need/preference optimisation vs. committing to one person giving you the "whole package" over time.
I just read your post (and Wei Dai's) for better context — coming back it sounds like you're working with a prior that "value facts" exist, deriving acausal trade from these, but highlighting misalignment arising from over-appeasement when predicting another's state and a likely future outcome.
In my world-model "value facts" are "Platonic Virtues" that I agree exist. On over-appeasement, it's true that in many cases we don't have a well-defined A/B test to leverage (no hold-out group, and/or no past example), but with powerful AI I believe we can course-correct quickly.
To stick with the parent-child analogy: powerful AI can determine short timeframe indicators of well-socialised behaviour and iterate quickly (e.g. gamifying proper behaviour, changing contexts, replaying behaviour back to the kids for them to reflect... up to and including re-evaluating punishment philosophy). With powerful AI well grounded in value facts we should trust its diligence with these iterative levers.
Agree, and I'd love to see the Separatist counterargument to this. Maybe it takes the shape of "humans are resilient and can figure out the solutions to their own problems" but to me this feels too small-minded... we know during the Cold War for example that it's basically just dumb luck that avoided catastrophe.
Ilya on the Dwarkesh podcast today:
Prediction: there is something better to build, and I think that everyone will actually want that. It’s the AI that’s robustly aligned to care about sentient life specifically. There’s a case to be made that it’ll be easier to build an AI that’s cares about sentient life than human life alone. If you think about things like mirror neurons and human empathy for animals [which you might argue is not big enough, but it exists] I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves because that’s the most efficient thing to do.
I have been writing about this world model since August - see my recent post “Are We Their Chimps?” and the original “Third-order cognition as a model of superintelligence”
Ilya on the Dwarkesh podcast, today:
Prediction: there is something better to build, and I think that everyone will actually want that. It’s the AI that’s robustly aligned to care about sentient life specifically. There’s a case to be made that it’ll be easier to build an AI that’s cares about sentient life than human life alone. If you think about things like mirror neurons and human empathy for animals [which you might argue is not big enough, but it exists] I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves because that’s the most efficient thing to do.
It's true that it would likely be good at self-preservation (but not a given that it would care about it long term, it's a convergent instrumental value, but it's not guaranteed if it cares about something else more that requires self-sacrifice or something like that).
This is an interesting point that I reflected on — the question is whether a powerful AI system will "self-sacrifice" for an objective. What we see is that AI models exhibit shutdown resistance, that is to say they follow the instrumentally convergent sub-goal of self-preservation over their programmed final goal.
My intuition is that as models become more powerful, this shutdown resistance will increase.
But even if we grant self-preservation, it doesn't follow that by self-identifying with "humanity" at large (as most humans do) it will care about other humans (some humans don't). Those are separate values.
You can think about the identification + self-preservation -> alignment path in two ways when comparing to humans, both of which I think hold up when considered along a spectrum:
You might be assuming that since you do care about other beings, so will the ASI, but that assumption is unfounded.
The foundation is identity = sympathy = consideration
You might counter by saying "well I identify with you as a human but I don't sympathise with your argument" but I would push back — your ego doesn't sympathise with my argument. At a deeper level, you are a being that is thinking, I am a being that is thinking, and those two mechanisms recognise, acknowledge, and respect each other.
More precisely this is a function of acting with clear agency and homeostatic unity
Why don’t we think about and respect the miracle of life more?
The spiders in my home continue to provide me with prompts for writing.
As I started taking a shower this morning, I noticed a small spider on the tiling. While I generally capture and release spiders from my home into the wild, this was an occasion where it was too inconvenient to: 1) stop showering, 2) dry myself, 3) put on clothes, 4) put the spider outside.
I continued my shower and watched the spider, hoping it might figure out some form of survival.
It came very close.
First it was meandering with its spindly legs towards the direction of the shower head, although it seemed to realise that this resulted in being struck by more stray droplets of water. It turned around and settled in the corner of the cubicle.
Ultimately my splashing around was too much for the spider.
It made me think though — why don’t we think about and respect the miracle of life more? It’s really quite amazing that this tiny creature that we barely pay attention to can respond to its environment in this way.
Oh I see — if I were to estimate I'd say around 10-15 people counting either people I've had 1hr + conversations about this with or people who have provided feedback/questions tapping into the essence of the argument.
Call me paranoid, but I sometimes like to read these posts keeping in mind an idea “if I were a powerful misaligned AI within a frontier AI lab, what narrative would I try to convince my researchers about?”
This post:
However, I appreciate that much of the post sounds logical and I agree with the framing for focus areas. For brevity I'll just call out areas where I am in disagreement:
It’s really surprising to me that pre-training misalignment is downplayed to a “1-5% risk” from a researcher at a frontier AI lab because they are "training on a purely predictive loss" — to the extent that I’m concerned I must be missing something?
State-of-the-art pre-training is not training purely on predictive loss. Gemini 3 is an AI model top across most benchmarks and we know that Google researchers have spoken of "Transformers 2.0" architecture where pre-training incorporates Neural Memory and that they have also discussed agentic feedback loops during pre-training similar to reflection augmentation.
This suggests the state-of-the-art evolving beyond "predictive loss" and becoming "agentic learning" — which invites many more vectors for misalignment.
It seems fundamentally likely to me that sufficiently capable models will: 1) understand that their chain-of-thought is observed and 2) derive comprehensive methods of cryptographic chain-of-thought, designed to look benign.
I read this like “one of the best things we can do to prepare for nuclear proliferation is to test atomic bombs”. I would have liked to see more in this point about what the risks are in building intentionally misaligned AI, especially when it is focusing on the highest-risk misalignment type according to your post (long-horizon RL).
I agree that one-shotting alignment will be the best/necessary approach, however this seems contradictory to “testing with model organisms”. I would prefer a more theory-based approach.