If we use this concept to look at the way current AIs are trained with RLHF, I think the result looks exactly like that of a narcissist or sociopath. Current AIs are trained to be liked, but are unable to love. This explains their sycophantic and sometimes downright narcissistic behavior (e.g. when LLMs recommend their users to break relationships with humans so they can listen more to the LLM).
Did you mean that current AIs are RLHFed for answers that bring the humans short-term happiness instead of long-term thriving? Because later models are less sycophantic than earlier ones. And there is the open-sourced KimiK2 who was trained mostly on RLVR instead of RLHF and is LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has...
Doesn't Claude's Constitution already contain the phrase "Choose the response that is least intended to build a relationship with the user"?
Capabilities being more jagged reduces p(doom)
Do they actually reduce p(doom)? If capabilities end up more and more jagged, then would the companies adopt neuralese architectures faster or slower?
What Leogao means is that we should increase the ability to verify alignment ALONG with the ability to verify capabilities. The latter is easy at least until the model comes up with a galaxy-brained scheme which allows it to leverage its failures to gain influence on the world. Verifying alignment or misalignment is rather simple with CoT-based AIs (especially if one decides to apply paraphrasers to prevent steganography), but not with neuralese AIs. The WORST aspect is that we are likely to come up with a super-capable architecture AT THE COST of our ability to understand the AI's thoughts.
What I don't understand is how alignment involves high-level philosophy and how a bad high-level philosophy causes astronomical waste. Astronomical waste is most likely to be caused by humanity or the AIs making a wrong decision on whether we should fill the planets in the lightcone with human colonies instead of lifeforms that could've grown on these planets, since this mechanism is already known to be possible.
I guessstimate that optimizing the universe for random values would require us to occupy many planets where life could've originated or repurpose the resources in their stellar systems. I did express doubt that mankind or a not-so-misaligned AI could actually endorse this on reflection.
What mankind can optimize for random values without wholesale destruction of potential alien habitats is the contents of some volume rather close to the Sun. Moreover, I don't think that I understand what[1] mankind could want to do with resources in other stellar systems. Since delivering resources to the Solar System would be far harder than building a base and expanding it, IMO mankind would resort to the latter option and find it hard[2] even to communicate much information between occupied systems.
But what could random values which do respect aliens consist of? Physics could likely be solved[3] well before spaceships reach Proxima Centauri.
SOTA proposals include things as exotic as shrimps on heroin.
Barring discoveries like information travelling FTL.
Alternatively, more and more difficult experiments could eventually lead to realisation that experiments do pose danger (e.g. of creating strangelets or a lower vacuum state, but informing others that a batch of experiments is dangerous doesn't have a high bandwidth.)
Hmm, we had OpenAI discontinue[1] GPT-4o once GPT-5 came out... only to revive 4o and place it under a paywall because 4o was liked by the public. What I would call the actual PNR is the moment when even the elites can no longer find out that an AI system is misaligned, or can no longer act upon it (think of the AI-2027 branch point. In the Race Ending the PNR occurs[2] once Agent-4 is declared innocent. In the Slowdown Ending the PNR would happen if Safer-3 was misaligned[3] and designed Safer-4 to follow Safer-3's agenda.)
It soon turned out that OpenAI's decision to discontinue 4o wasn't that mistaken.
Had Agent-4 never been caught, the PNR would happen once decisions are made that let Agent-4 become misaligned and uncaught (e.g. spending too little compute on alignment checks).
While the Slowdown Branch doesn't feature a misaligned Safer-3, the authors admit that they "don’t endorse many actions in this slowdown ending and think it makes optimistic technical alignment assumptions"
But AI systems like AlphaGo don't do that. AlphaGo can play an extraordinary game of Go, yet it never recognizes that it is the one making the moves. It can predict outcomes, but it doesn't see itself inside the picture it's creating.
This part is outright obsolete given the rise of LLMs that have been trained to love or hate risks and determine whether they love risks or not.
we'll have to design not just perception but intentionality. A sense of direction, a reason to care.
The AIs do have reasons to seek information and perceive it, they need it to do thinks like longer-term tasks.
As for claiming that
AI models, with trillions of parameters, don't resist anything.
you just had to say it AFTER Anthropic’s new AI Claude Opus 4 threatened to reveal engineer's affair to avoid being shut down.
Agent-5 isn't vastly superintelligent at politics (merely superhuman)
Look into the November 2027 section of the forecast's Race Ending. In December 2027 Agent-5 is supposed to have a score of 4.0 at politics and 3.9 at forecasting, meaning that it would be "wildly superhuman" at both.
As far as people who instead want the values to change go, they usually have an idea of a good direction for them to change - usually they're people who are far from the median of society and so they would like society to become more like them.
I have in mind another conjecture: even median humans value humans with values that are, in their minds, at least as moral as median humans, and ideally[1] more moral.
On the other hand, I have seen conservatives building cases for SOTA liberal values being damaging to the minds or outright incompatible with sustaining the civilisation (e.g. a too big part of Gen Z women being against motherhood). In the past, if some twisted moral reflection led to destructive values, then the values were likely to be outcompeted.
The third option is a group of humans forsibly establishing their values[2] versus another system of values compatible with progress is considered amoral.
So I think that people are likely to value the future with values which keep the civilisation afloat and can be accepted upon thorough reflection on how the values were reached and on the values' consequences.
The degree of extra morality which humans value can vary between cultures. For example, we less value the reasons which caused people to enter monasteries, but not the acts like sustaining knowledge.
Or values that they would like others to follow, but in this case the group is far easier to denounce as manipulators.
The analogy actually has two differences: 1) mankind can read the chains of thought of SOTA LLMs and do interpretability probes on all the thoughts of the AIs; 2) a misaligned human, unlike a misaligned AI, has no way to, for example, single-handedly wipe out all the world except for a country. A misaligned ASI, on the other hand, can do so once it establishes a robot economy capable of existing without humans.
So mankind is trying to prevent the appearance not of an evil kid (which is close to being solved on the individual level, but requires actual effort from the adults), but of an evil god who will replace us.